The construction of this library has made evident many of the advantages and disadvantages of the WWW as a vehicle for digital libraries. Some of its advantages are that HTML represents the first widely-used open standard for text mark-up -- previously most documents widely exchanged on the Internet were ASCII text. Hy- pertext links are useful in digital libraries for footnotes, hypertext tables of contents, and references to external documents. High quality, freely available web browsers are available. The forms interface and CGI programs make it possible to do things that books have never before done.
One factor that many editors will consider a disadvantage is that they will have to give up much control over the appearance of the books they prepare, since HTML is essentially a structural markup language. Another disadvantage is that URLs specify locations on the Internet. It would be convenient to be able to have hypertext links that refer to a particular document rather than a particular location: local or regional copies, mirror sites, and backup servers would be useful for large documents located on a single server which may be heavily used. To address these needs, universal document identifiers such as the Universal Resource Numbers of HTML+ would be very helpful.
The transmission vehicle for the WWW is the Internet, and most people who have access to the Internet have it at the office or university in the form of a fast con- nection to a local area network which is connected to the Internet. This is a disadvantage in that books that are read via the WWW are usually read on the screens of desktop computers. Probably because this is unaesthetic if not injurious for long periods of reading, books are often just browsed or read for a short while. Of course, the transmission vehicle for the WWW may become a great advantage if wireless connections to the Internet become common. Then a small hand-held bookreading device such as an Apple Newton would be almost as usable for bookreading as a traditional book. Although traditional books will still have ad- vantages such as the sharpness and contrast of the page, a bookreader on a wireless network will have other significant advantages, such as small physical size, hypertext links, and the free and instantaneous availability of thousands of classic books and reference works.
Then too, users often want to print out a large document such as a book, or perhaps an extended section of the document. If it is stored as a web of files, printing is prob- lematic. One could maintain separate versions of the document, one for printing and one for the web, but that also has drawbacks.
The approach taken to address this problem in the CCEL is to store large reference works as one or a few large HTML files. A CGI program [3] is then used to select the desired section of the document and return it only as an HTML document. I call the program that selects and returns small pieces of a large HTML document the "pager".
<A NAME="Section1">This is section 1.</A>
Unfortunately, the text between the <A> and </A> tags is not supposed to contain anchors, and nested anchors give unpredictable results on some browsers. Therefore, sections containing anchors cannot be surrounded by <A> . . . </A> tags. In practice, named anchors are used to signify locations in a file rather than text extents. Thus, text extents must be specified with a beginning point and an ending point.
The HTML 3.0 DTD draft of 19-Jan-95 [5] changes the status of the NAME attribute for anchors to "deprecated." In its place, an ID attributed is supported for most elements. The ID attribute can be used in place of NAME to mark the require points in the text.
<h1> <a name="RTFToC4">2.2 Returning a Section of a Large Document </a></h1>
Suppose the anchor were used as the starting point of a section of the document. The start of the section returned to the user would be
<a name="RTFToC4">2.2 Returning a Section of a Large Document </a></h2>
Notice that the <h1> start tag is missing, while the </h1> end tag is present. The returned section is not valid HTML, and furthermore it would result in wrong rendering by most browsers. In general, a section of a file returned in this manner would also lack the <HEAD> section of the document and any HTML container tags that were not terminated by the start of the section. Therefore it is in general necessary to parse the original file and return any required start tags, the selected extent of text, and any required end tags. It may also be desirable to prepend a header and other information to the text.
Since the part of the document between these tags may not be valid HTML, the pager could parse the document to add any required tags, or a preprocessor could be used to modify the file so that sections between named anchors represent valid HTML. However, not all of this work may be necessary in practice if the HTML document is constructed in such a way that the named sections constitute valid HTML.
Another design problem to address concerns efficiency. It is clearly undesirable to re- turn an entire encyclopedia when one article is requested, but it is also undesirable for a pager program running on a server to read sequentially through an entire encyclopedia to find a requested article. This problem may be alleviated by breaking up a large document into a few smaller documents. A better solution is to have the pager program automatically construct an index to a large HTML document the first time it is read and parsed, storing the starting and ending character positions and the tags to be prepended and appended for each named section. Later accesses are achieved by reading the index file and directly returning the part of the document requested without parsing. If file modification is detected, the index is rebuilt. Many additional features that are useful for bookreading can be added onto this basic structure. Forward, backward, beginning, and index buttons would obviously be use- ful. Other features might include adjustable parameters for characteristics of perfor- mance such the preferred number of pages to return at one time, whether to include footnote texts at the bottom of each page, whether to display a progress meter, and so on. Context information, such as the current page and user preferences, may be specified in the HTTP query:
pager.cgi?file=book.html&from=section_1&to=section_3&footnotes=false&meter=t rueThe prototype pager [6] returns the document head and the part of the body before the first named anchor. It does not parse or construct indexes, and the only additional features it currently provides are forward, backward, beginning, all, and up buttons and the ability to specify from and to section names. Never- theless, it makes the use of digital books on the WWW much more efficient, and users say they love it. Formatting a book for use by the pager only involves inserting named anchors to delimit pages in such a way that paired start and end tags don't span them.
The pager is especially useful for large reference documents such as dictionaries, where small sections of the text are desired. It is possible to construct a complete index file in such a way that following a link downloads only the article of interest. Page forward and backward buttons can be used to browse the dictionary. If the index file is large, it too may be paged, resulting in a two-level index.
A disadvantage of using the pager is that named anchors must be inserted into the HTML document to delimit pages and an additional program must be run on the server for every page returned. Users are not eager to add named anchors to books by hand; a utility for normalizing and adding anchors would be a useful addition. It is currently unaesthetic to read books on most computer screens in most cases. However, the situation is likely to change dramatically as hand-held computers suitable for bookreading become more common and more commonly connected to the Internet. Then it is likely that the large number of books and reference documents already available on the WWW will make its use for bookreading and digital library access grow even more dramatically.
I conclude with a plea to browser authors: support the HTML link tags for previous, next, and parent documents, e.g. <LINK HREF="docname" REL="next">. Support the navigation to those documents with left, right, and up arrow keys, and scrolling down and accessing the next document by pressing the space bar. Finally, offer an option of automatically pre-fetching the next page. Then nearly all network delays and mouse actions for remote bookreading would effectively be eliminated.