DAGS95: Electronic Publishing and the Information Superhighway
May 30--June 2, 1995
Boston, Massachusetts

Digital Libraries and Large Text Documents on the World Wide Web

Harry Plantinga
University of Pittsburgh
Department of Computer Science
Pittsburgh, PA 15260
planting@cs.pitt.edu
http://ccel.wheaton.edu/~whp

Abstract

The World Wide Web (WWW) has strengths and weaknesses as a delivery vehicle for digital libraries. This paper discusses experiences with a small digital library on the WWW and describes some of the problems encountered. One problem in particular is addressed: that of the HTTP data delivery model, in which entire documents are transferred and displayed. This model is not ideal for large reference documents such as encyclopedias, dictionaries, and commentaries. This paper describes the approach taken to address this problem, of paging large documents into smaller HTML documents, while ensuring the validity of the returned HTML sub- document and minimizing the load on the server.

Contents

1. Introduction

The Christian Classics Ethereal Library (CCEL) is a small, experimental digital library on the World Wide Web (WWW) [1]. Its purpose is in part to experiment with electronic publishing and digital libraries on the WWW. It was started in May 1994, and as of February 1995 it had 26 HTML books and hundreds of other books and documents in text, HTML, RTF, PDF, and other formats. The access rate has been increasing by about 50% per month of late, reaching 70,000 for February 1995. When people hear about the existence of a library on the WWW, they often make a comment along the lines of "Ugh, who would want to read a book on a computer screen?" I have sympathy for that point of view. The longest stretch of reading of a single book on a computer or PDA screen that I have managed is about an hour, on an Apple Newton PDA. But that is not the way this library is most commonly used. The most common use is accessing reference works, where only a small portion of text is needed and the searching and indexing capabilities of computers are most useful. These reference works include traditional reference books, such as a dictionary or commentary, and new reference works which contain links to other items on the Internet. Another common use is for browsing books, which can be downloaded and printed if they are of interest. Also, on-line libraries serve as the destination of hypertext links from other works.

The construction of this library has made evident many of the advantages and disadvantages of the WWW as a vehicle for digital libraries. Some of its advantages are that HTML represents the first widely-used open standard for text mark-up -- previously most documents widely exchanged on the Internet were ASCII text. Hy- pertext links are useful in digital libraries for footnotes, hypertext tables of contents, and references to external documents. High quality, freely available web browsers are available. The forms interface and CGI programs make it possible to do things that books have never before done.

One factor that many editors will consider a disadvantage is that they will have to give up much control over the appearance of the books they prepare, since HTML is essentially a structural markup language. Another disadvantage is that URLs specify locations on the Internet. It would be convenient to be able to have hypertext links that refer to a particular document rather than a particular location: local or regional copies, mirror sites, and backup servers would be useful for large documents located on a single server which may be heavily used. To address these needs, universal document identifiers such as the Universal Resource Numbers of HTML+ would be very helpful.

The transmission vehicle for the WWW is the Internet, and most people who have access to the Internet have it at the office or university in the form of a fast con- nection to a local area network which is connected to the Internet. This is a disadvantage in that books that are read via the WWW are usually read on the screens of desktop computers. Probably because this is unaesthetic if not injurious for long periods of reading, books are often just browsed or read for a short while. Of course, the transmission vehicle for the WWW may become a great advantage if wireless connections to the Internet become common. Then a small hand-held bookreading device such as an Apple Newton would be almost as usable for bookreading as a traditional book. Although traditional books will still have ad- vantages such as the sharpness and contrast of the page, a bookreader on a wireless network will have other significant advantages, such as small physical size, hypertext links, and the free and instantaneous availability of thousands of classic books and reference works.

2. Large Documents on the WWW

However, the problem that I wish principally to address is that the hypertext model used by the WWW is not ideal for digital libraries. The model used is that of transmitting and displaying an entire document when a link is activated. Some browsers improve the model by displaying a partial document while it is being downloaded in the background. But even in that case, if a particular location in a document is referenced by a hypertext link, nothing can be displayed until the entire document to that point has been transmitted. This is inconvenient for large documents such as books, where it may take minutes to download the document before it is available for reading. But it is especially inconvenient for large reference works, which may constitute many megabytes of data, and of which only a very small section may be of immediate interest. The method of dealing with this problem suggested in HTML writing guides [2] is to break up large documents into a number of smaller documents. But this approach degenerates to impracticality for large reference works. Imagine, for example, a dictionary with 20 M bytes of data and 100,000 entries. Should it be divided into 100,000 separate files? The allocated but unused disk space alone would be 1,580 M bytes on a file system with 16 K byte allocation units. It may also be that the file system of the server will prove inefficient at indexing and access- ing such a large number of files. In addition, creation and maintenance of such a large number of files would be difficult and slow. Should the dictionary be stored with a smaller number of larger files? Much data will needlessly be transmitted across the Internet and users will have to wait longer than necessary for desired in- formation to appear.

Then too, users often want to print out a large document such as a book, or perhaps an extended section of the document. If it is stored as a web of files, printing is prob- lematic. One could maintain separate versions of the document, one for printing and one for the web, but that also has drawbacks.

The approach taken to address this problem in the CCEL is to store large reference works as one or a few large HTML files. A CGI program [3] is then used to select the desired section of the document and return it only as an HTML document. I call the program that selects and returns small pieces of a large HTML document the "pager".

2.1 Subdocument Addressing

2.1 Subdocument Addressing In order to return a subdocument it is first necessary to be able to specify the text ex- tent of interest. The HTML 2.0 DTD draft [4] offers a name attribute for an anchor as a means of naming a section of text. For example,

<A NAME="Section1">This is section 1.</A>

Unfortunately, the text between the <A> and </A> tags is not supposed to contain anchors, and nested anchors give unpredictable results on some browsers. Therefore, sections containing anchors cannot be surrounded by <A> . . . </A> tags. In practice, named anchors are used to signify locations in a file rather than text extents. Thus, text extents must be specified with a beginning point and an ending point.

The HTML 3.0 DTD draft of 19-Jan-95 [5] changes the status of the NAME attribute for anchors to "deprecated." In its place, an ID attributed is supported for most elements. The ID attribute can be used in place of NAME to mark the require points in the text.

2.2 Returning a Section of a Large Document

2.2 Returning a Section of a Large Document The other need identified above is the ability to return a small section of an HTML document. However, just returning a specified section leads to a couple of problems. First, a section of a valid document HTML may not be valid HTML. For example, a tool for converting files in a word processor format (RTF) to HTML might convert a heading in the word processing file to HTML like this:

<h1>
<a name="RTFToC4">2.2
Returning a Section of a Large Document
</a></h1>

Suppose the anchor were used as the starting point of a section of the document. The start of the section returned to the user would be

<a name="RTFToC4">2.2
Returning a Section of a Large Document
</a></h2>

Notice that the <h1> start tag is missing, while the </h1> end tag is present. The returned section is not valid HTML, and furthermore it would result in wrong rendering by most browsers. In general, a section of a file returned in this manner would also lack the <HEAD> section of the document and any HTML container tags that were not terminated by the start of the section. Therefore it is in general necessary to parse the original file and return any required start tags, the selected extent of text, and any required end tags. It may also be desirable to prepend a header and other information to the text.

3. A Pager for Large HTML Documents

The design goal for the pager was that it return pages from documents stored as standard HTML rather than some other format. Therefore the HTML named anchor facility was used for identifying the beginning and end of sections of text. In essence, the idea behind the pager is simply that it returns the section of an HTML document between a <A NAME = "section_name"> tag and the next <A NAME = ""> tag or the section between two specific named anchor tags, along with some header information.

Since the part of the document between these tags may not be valid HTML, the pager could parse the document to add any required tags, or a preprocessor could be used to modify the file so that sections between named anchors represent valid HTML. However, not all of this work may be necessary in practice if the HTML document is constructed in such a way that the named sections constitute valid HTML.

Another design problem to address concerns efficiency. It is clearly undesirable to re- turn an entire encyclopedia when one article is requested, but it is also undesirable for a pager program running on a server to read sequentially through an entire encyclopedia to find a requested article. This problem may be alleviated by breaking up a large document into a few smaller documents. A better solution is to have the pager program automatically construct an index to a large HTML document the first time it is read and parsed, storing the starting and ending character positions and the tags to be prepended and appended for each named section. Later accesses are achieved by reading the index file and directly returning the part of the document requested without parsing. If file modification is detected, the index is rebuilt. Many additional features that are useful for bookreading can be added onto this basic structure. Forward, backward, beginning, and index buttons would obviously be use- ful. Other features might include adjustable parameters for characteristics of perfor- mance such the preferred number of pages to return at one time, whether to include footnote texts at the bottom of each page, whether to display a progress meter, and so on. Context information, such as the current page and user preferences, may be specified in the HTTP query:

pager.cgi?file=book.html&from=section_1&to=section_3&footnotes=false&meter=t
rue
The prototype pager [6] returns the document head and the part of the body before the first named anchor. It does not parse or construct indexes, and the only additional features it currently provides are forward, backward, beginning, all, and up buttons and the ability to specify from and to section names. Never- theless, it makes the use of digital books on the WWW much more efficient, and users say they love it. Formatting a book for use by the pager only involves inserting named anchors to delimit pages in such a way that paired start and end tags don't span them.

4. Conclusion

The pager has made the use of the WWW for a digital library much more practical. Even slow Internet links are suitable for bookreading when small sections of a book can be accessed individually. Some of the particular benefits of the use of the pager are that large documents can be stored in one or a few files; small sections of a large document can be referenced and retrieved individually; it is not necessary to transmit an entire book up to a point in order to start reading it at that point; and HTML links can be made to locations inside a book without the concern that following the link will download an entire book.

The pager is especially useful for large reference documents such as dictionaries, where small sections of the text are desired. It is possible to construct a complete index file in such a way that following a link downloads only the article of interest. Page forward and backward buttons can be used to browse the dictionary. If the index file is large, it too may be paged, resulting in a two-level index.

A disadvantage of using the pager is that named anchors must be inserted into the HTML document to delimit pages and an additional program must be run on the server for every page returned. Users are not eager to add named anchors to books by hand; a utility for normalizing and adding anchors would be a useful addition. It is currently unaesthetic to read books on most computer screens in most cases. However, the situation is likely to change dramatically as hand-held computers suitable for bookreading become more common and more commonly connected to the Internet. Then it is likely that the large number of books and reference documents already available on the WWW will make its use for bookreading and digital library access grow even more dramatically.

I conclude with a plea to browser authors: support the HTML link tags for previous, next, and parent documents, e.g. <LINK HREF="docname" REL="next">. Support the navigation to those documents with left, right, and up arrow keys, and scrolling down and accessing the next document by pressing the space bar. Finally, offer an option of automatically pre-fetching the next page. Then nearly all network delays and mouse actions for remote bookreading would effectively be eliminated.

Notes

[1]
It is available at http://ccel.wheaton.edu/
[2]
CERN's HTML+ documentation, for example, states that "Keeping a large document such as a book in one node will increase the time it takes to retrieve the node over the network. It is generally better to split large documents into a number of smaller nodes" (http://info.cern.ch/hypertext/WWW/MarkUp/HTMLPlus/htmlplus_9.html).
[3]
CGI programs are programs that can be written for the world wide web that in response to query strings generate and return documents on the fly.
[4]
The March 7, 1995 draft, available from HTTP://info.cern.ch/hypertext/WWW/MarkUp/html-spec/html.dtd
[5]
Available from http://info.cern.ch/hypertext/WWW/MarkUp/html3-dtd.txt
[6]
Available from http://ccel.wheaton.edu/pager.txt