Using HTML on the Commodore, Part 1

          by Jim Brain (brain@mail.msen.com) 

Note: Due to the recent relocation of myself and my family, I am behind on the development of the HTML viewer for the Commodore system. Therefore, this article will not focus on the actual viewer. With the development below 50% complete, the modules are subject to change. Describing them now would only confuse issues.

Introduction

HTML. This simplistic acronym, unknown to most people before 1993, now forms the heart of discussions. Its status is secured, as employers ask how much "experience" one has with it, and resumes commonly include it. A quick tally in any technical magazine reveals hundreds of references to it, and trips to the bookstore yield mountains of titles referring to it.

Most Commodore owners have a few questions about this acronym. First, what is it? Second, why should I care about it? In this series of articles, I will try to answer both questions to your satisfaction.

To answer the first question, let's step back to explain the World Wide Web (WWW). This explanation is not designed to replace more thorough treatments of the subject. In 1991, while working as a researcher at the CERN laboratory in Switzerland, Tim Berners-Lee developed a hypertext information retrieval system that allowed researchers at the lab to design informative online "presentations" of their work. In each presentation, a researcher could reference a document or presentation located elsewhere on the lab-wide network of computers. This reference was "live", meaning that a person could select it from the document and immediately view the referenced document. Thus, a matrix of related documents were created to interconnect the researchers' work.

In an effort to offer the researchers great latitude in presenting their works while retaining some standard in layout, Berners-Lee found simple ASCII text an inadequate presentation method. Clearly, a document formatting procedure, or "markup language" was needed. However, Berners-Lee found that popular document markup languages did not support the concept of referencing, or "linking" between documents in a standard and non-proprietary way. After looking past popular approaches like Windows help files, troff, TeX, and Rich Text Format, Berners-Lee found a standardized markup language that would support links and provide flexibility in creating documents, yet retain some semblance of commonality. The language was the Standard Generalized Markup Language (SGML).

SGML in itself was derived from an IBM specific markup language called Generalized Markup Language (GML). After some minor changes, the IBM GML specification became standardized. SGML, though, represents more than a simple formatting schema. SGML allows one to create multiple derived markup languages off the SGML base, and a suitable program can interpret each derived language independently. Thus, HTML functions as a derivation of SGML.

Berners-Lee created the original specification for HTML while working on the WWW framework. Since mid 1993, when the first graphical HTML viewer arrived from the University of Illinois, the HTML specification has been revised and updated at least 4 times, but remains an SGML derived language.

The Basics of HTML

HTML, like most formatting or document markup languages, allows the document creator to insert special labels, or "tags" into the document, which the language processor can parse. The language processor then converts these tags into the special formatting options they represent. In a simplistic markup language, one might place an asterisk "*" next to any word to be highlighted. As the "marked up" document is read and parse by the language processor, the resulting output would highlight each word preceded by an asterisk. The asterisk itself would be stripped from the resulting display, as it does not form part of the document itself. In much the same way, HTML allows creators to insert HTML tags into the document being formatted. An HTML display system (commonly called an HTML viewer if the document is local or an HTML browser if the document can be accessed from a remote location) then parses the tags and renders the presentation of the document on a suitable display.

HTML tags come in pairs. For each "open" tag, there is a corresponding "close" tag. All tags are simple ASCII words or letters preceded by a less-than "<" character, and followed by a greater-than ">" character. A simple tag is "HTML", which tells the browser that the document to follow is marked up in HTML. This tag takes the form:

<HTML>

Since tags are not case sensitive, <html> can be used as well. This tag is the HTML open tag, and it has a corresponding close tag. In HTML, a close tag is formed by inserting a slash "/" character after the less- than character and before the tag name. Thus, </HTML> would form the close HTML tag.

Some tags require optional information. This information is included after the tag name and before the greater-than character. Such tags include IMG, which instructs the HTML display system to load and display a graphics element at the present location. Since the location and name of the graphics element is needed, it is included as an "attribute" in the tag. To display a photo called jim.gif, I would include:

<IMG SRC=jim.gif>

in my document. Notice the space between the tag name and the attribute name. That space is necessary.

IMG does indeed have a corresponding close tag, but since IMG doesn't turn something on that must be turned off, the closing tag is seldom used. That forms the basis for using closing tags. Opening tags that "turn-on" a formatting style require closing tags. For opening tags that do not "turn-on" a formatting style, closing them off is optional. Of course, exceptions exist, but you'll rarely go wrong marking up with this rule in mind.

The BASIC HTML Tags

The following tags are considered basic since they implement either the essential or often used formatting options available in HTML. Each opening tag is listed in its HTML form, and a description of the tag is given:

Tag       Description 

<html>    begins an HTML document 
<head>    specifies the heading information (title, etc.) 
<body>    specifies the body of the document (information) 
<p>       Inserts a paragraph.   
<hX>      Renders the following text in heading size X.  1 <= X <= 6. 
                  H1 is largest, while H6 is smallest 
<br>      Line break 
<title>   Specifies the title of the document 
<hr>      horizontal rule (line across document) 
<strong>  Emphasize text strongly (typically rendered as bold text) 
<em>      Emphasize text (typically rendered as italics) 
Remember, this is but a few of the possible tags.

Creating an HTML Document

In HTML, HTML documents are referred to as "pages", and each page is constructed as a simple ASCII or ISO 8859-1 (superset of ASCII) text file. No preprocessing is necessary. This makes creating documents as easy as editing a text document. HTML files are typically given the file extension ".html", and IBM PC computers running MS-DOS typically shorten this to ".htm" due to DOS limitations. However, the former extension is most correct. Although fancy HTML generation applications exist, most people on all platforms simply create pages using a text editor. Since Commodore owners can usually find a text editor, Commodore enthusiasts can create pages just as easily as anyone. Additionally, the WWW and HTML encourage writers to create small pages, and break up large documents into linked pages of smaller sizes. Typically, HTML documents are less than 10 kilobytes in length. At that size, even an expanded VIC-20 can create full size HTML pages.

Let's create our first document. Edit a file called template.html and place the following text inside it:

<html> 
<head> 
<title>This is an HTML title</title> 
</head> 
<body> 
<h1>This is an example of Heading 1</h1> 
This is a paragraph. 
<p> 
This is another paragraph. 
<br>
I want you to see this next sentence.  <strong>Therefore, I am strongly 
emphasizing it</strong>. 
Now we are back to normal.
This sentence is below the last in the source, but will appear following 
it when displayed. 
</body> 
</html> 

Notice which tags require closing. Also, notice how <HEAD> and <BODY> are used in the document. Notice the two final sentences in the above example. The sentences appear on different lines in the document, but HTML specifies that all carriage returns will be translated into spaces. It further specifies that if multiple spaces exist in a file, they will be reduced to a single space. Thus, using spaces as alignment helps will not work in HTML. Likewise, using linefeeds and carriage returns to specify alignment will also fail. If a new line is necessary, use <p>, which will leave a blank line, or <br>, which start a new line.

What's in it for Commodore Enthusiasts?

This is an interesting question, and I hope you agree with my answer. Many claim that HTML is useless to the Commodore owner since the Commodore can't display HTML. While I am not even sure that is true, (I've heard of simple HTML viewer programs for the 128), it doesn't matter. Commodore owners who access the Internet from a "shell" account can access the World Wide Web via the "Lynx" text browser. Since the WWW is constructed of HTML pages, those Commodore owners can indeed view HTML files while online. Many Commodore enthusiasts possess useful information. Putting that information on the Internet via HTML and WWW makes it widely available to other Commodore and non-Commodore computer owners. Why worry about the latter? You'd be surprised how many former Commodore owners are coming back into the fold after viewing some Commodore HTML pages. The information on those pages triggers fond memories. Many fire off messages inquiring about purchasing a new or used CBM machine after seeing these pages.

To the naysayers, I submit that there is nothing PC-centric in the HTML standard. If an HTML viewer doesn't yet exist, it has nothing to do with the computer system. As HTML was created to allow successful operation over many different computer systems and graphics capabilities, HTML encourages usage on computer systems like the Commodore, where there are limitations in display size and resolution.

In fact, the Commodore community should embrace HTML as a markup language, for it represents a standard way to effectively mark up documentation for viewing on a variety of computer systems. Using HTML opens up a whole set of possibilities for easily created, standardized documentation publication.

Disk magazines, like _LOADSTAR_, _DRIVEN_, _VISION_, and _COMMODORE CEE_, could produce issues that contain more layout information than now offered. Since the viewer would now be standardized, these publications could possibly forego the distribution of the viewer software and offer more content in the extra space on disk. A side benefit is the ability for Commodore users to read each issue on any platform. Possibly you'll never need to read LOADSTAR 128 Quarterly on an IBM PC, but what about reading it on a 64, while your sole 128 does something else? Moving to HTML would shift a disk magazine's focus and concern from the presentation, which would become standard, to content, which is why Commodore owners read such magazine anyway. How many times has otherwise great information been presented badly in a disk magazine? Use of HTML could help alleviate that problem. Publishing a disk magazine is time consuming because not only must editors work on the articles themselves, they must also write the software that presents the articles to the viewer. Using HTML and a pre-written browser would allow editors to spend more time on laying out and editing articles.

Disk magazines aren't the only winners here. Have you ever wanted to create a small publication? The use of HTML and a third-party HTML viewer makes it easy for you to do so. Just like the editors of bigger publications, HTML allows you to concentrate on presenting your information without worrying about writing the presenter software. Now, obviously not everyone should publish their own magazine, but how about help files, information disks, software documentation, club newsletters, etc.? These publications can all benefit from this technology.

These are but a few of the benefits of switching to HTML for document layout. Other uses include upward compatible support. Using HTML allows the Commodore 128 user to view documents created for the 64 in 80 columns by 50 rows. C128D owners can take advantage of their 64kB video RAM even when viewing documents created on 16kB video RAM C128s. Publishers would no longer be constrained by lowest common denominator support. They can now include whatever they want and be assured that the presentation will look fine on all platforms. When a user upgrades his machine, he or she can immediate utilize those new features without requesting a new version of the publication. Also, for software, even though the software itself might differ by machine, the online documentation need be written only once. As well, never forget that marking up in HTML makes migrating your documents to the Internet and the WWW a snap!

Creating an HTML viewer on the Commodore

Obviously, before Commodore users can reap the benefits of HTML, we must create both a HTML generator and a viewer. The generator is easy, as HTML is simply ASCII text files. So, we are left to design and implement an HTML viewer. The following conditions should be met:

At first, we're going to concentrate on developing our viewer for the Commodore 64, although we should strive to offer versions for the 128, C65, Plus/4, C16, B series, PET, and VIC-20. I am reasonably confident on all but the last one.

Although we intend to develop a viewer that supports the above, our initial development will operate on a much smaller scale. The first revision of this viewer will operate on the stock machine and will contain support for the basic HTML tags as outlined above. Our design will allow us to extend the capabilities to encompass our goals.

The Viewer Execution Flow

I am not very good at drawing execution flows, and the native format of this magazine doesn't lend itself well to them, anyway. Therefore, I will simply describe the execution flow.

The viewer will start by asking the user for a document to access. If the file does not exist, an error is printed and the user is asked again. If the file exists, the viewer will begin reading it. If a tag is found, the tag should be acted upon. If text is loaded, it should be displayed on the screen using the current markup controls unless the control information is incomplete. In this case, the text should be stored for later display. The file should be parsed in this way, until the end is found. Then, the system will wait for either the user to select a link or type in a new document to view.

Most of the time, text can be displayed as soon as it is received. However, there are exceptions. Some tags, like the <TABLE> tag, which creates a table on the screen, require that all the data in the table be known before the table cell information can be calculated. In cases like these, we must store the data and wait for the </table> tag.

The above flow explanation ignores some subtleties like carriage return stripping and multiple space reduction. Those are left out because at least one tag, the <PRE> tag (preformatted text) overrides those rules. <PRE> text is displayed in a monospaced font exactly as it is prepared in the document. Text is not wrapped, and spaces are not reduced. So, we will make those formatting options that are normally turned on.

Conclusion

I regret that we haven't gotten very far in the development process with this installment, but we'll make up for lost time in the next installment. One thing that I would like to encourage from readers is comments and suggestions. Do you see a problem with some of the above information? Do you have a better way to parse some of the information? Do you see limitations in the data structures? Since we haven't delved into some of these aspects yet, do you have some ideas of your own? I can guarantee that I'm ready to discuss them with you; however, I can't read your mind. I think it's important that this project be completed, as it forms the core of a successful WWW browser, and I see everyone wanting to know when one will be available. I am less concerned that my name appear on the finished product. In fact, I think a product that draws on the talent of the entire Commodore community would most likely exceed the quality a single individual can afford a piece of software. So, fire up those assemblers and put on those thinking caps.

C= Hacking Home | Issue 13 Contents


Copyright © 1992 - 1997 Commodore Hacking

Commodore, CBM, its respective computer system names, and the CBM logo are either registered trademarks or trademarks of ESCOM GmbH or VISCorp in the United States and/or other countries. Neither ESCOM nor VISCorp endorse or are affiliated with Commodore Hacking.

Commodore Hacking is published by:

Brain Innovations, Inc.
10710 Bruhn Avenue
Bennington, NE 68007

Last Updated: 1997-03-11 by Jim Brain