Christopher B. Browne's Home Page
cbbrowne@acm.org

Document Formats

Christopher Browne

$Id: document.sgml,v 1.51 2005-12-28 14:06:08 cbbrowne Exp $

Table of Contents
1. Document Presentation Languages
2. LaTeX and TeX
3. Postscript
4. What Does Chris Use?
5. eBooks

1. Document Presentation Languages

Document presentation languages are generally text-based languages that provide some form of "markup" system to describe how information should be formatted. They typically include both "structural" functions (where, for instance, one would indicate that a piece of text is a "section title," which implies that it should presented in a bold/large font, perhaps with indexing information being automatically recorded), and "physical" functions (such as indicating font selection).

Common selections of "presentation languages" would include:

troff or nroff or groff

These languages have, of late, been used primarily for managing documentation for Unix utilities, and have a reputation for being somewhat arcane, particularly if one wishes to develop macros to extend the system.

O'Reilly Books uses groff as the typesetting "engine" for some of their books, which shows clearly that this method is capable of producing output that is of professional quality.

groff had not been under active development for quite some time; there is now a CVS server, and efforts continue at GROFF Development Site.

The Open Road: Creating your own man pages

TeX

While working on the (still in progress) series The Art of Computer Programming, Donald E. Knuth, Professor Emiteritus of The Art of Computer Programming, found he was extremely unhappy with the quality of typesetting provided by his publishers, and created his own font generation and typesetting system called TeX.

TeX is (despite a lack of visual appearance) a member of the LISP family of languages, and encapsulates many of the "best practices" of modern typesetting. The output is commonly considered to be beautiful.

Unfortunately, TeX does not shield the user from knowledge of the way it sets text. The macro system used to extend TeX is not the simplest thing in the world to understand, and there are rather a lot of commands and macros to understand. Building documents in "raw" TeX is not something for the timid of heart. TeX is perhaps best used as the underlying engine on top of which a "presentation language" is constructed.

TeX is supported by the TUG - TeX Users Group.

Various books are available:

LaTeX

Leslie Lamport created a set of macros based on the classic SCRIBE markup system (that was also a direct progenitor of SGML ) on top of TeX that provide infrastructure to implement a set of commonly document styles such as letters, reports, and articles.

LaTeX provides a much more "friendly" face than TeX; it does more of the work of managing where things go on the page for you, and generally requires less "hacking around" to get the job done. It comes with additional tools such as:

BibTeX

For managing bibliographies

See also CL-BibTeX, a replacement for BibTeX written in Common Lisp. The aim is to allow formatting entries using CL programs rather than the stack language of BibTeX.

Perhaps the most interesting bit of this application is a compiler that transforms old style BibTeX style files into comprehensible CL programs.

makeindex

For managing book indexes

and provides an assortment of other such tools for managing the "lists" of things that commonly are attached to books and technical papers.

Many macro extensions are available to implement specialized document styles.

TeXinfo

TeXinfo is a format created for the GNU project about ten years ago for managing technical documentation. TeXinfo files can be converted into several forms:

LaTeX

So that attractive printed documentation can be generated

TeXinfo Index

Providing quick lookups of information via subject and other indices

TeXinfo Viewer

A hypertext form readable using an info viewer or within the text editor GNU Emacs.

The fact that TeXinfo integrates a hypertext form, printed documentation, and searchable text provides some superiority over the (by most accounts more popular) HTML format that has grown popular over the last couple of years.

Latte - Language for Transforming Text

This language, licensed under the Zanshin Public License (ZPL), takes tagged text that looks a whole lot like LaTeX , and transforms it to HTML.

PMW (Philip's Music Writer)

Philip's Music Writer (PMW) is a computer program for high quality music typesetting.

SGML - Standard Generalized Markup Language (SGML)

SGML, which has been an ISO standard for well over ten years now, is a "metalanguage;" a language in which one can write structural markup languages. The markup language specification is called a "DTD" or "Document Type Definition." Early adopters included the military and aerospace industries, where products require extensive amounts of highly technical documentation. Different applications will require different markup schemes; one might create a DTD describing a "letter" format; a number of DTDs exist to describe different kinds of books and reports. The Linux Documentation project has defined a DTD describing documents like this one.

SGML systems typically provide schemes for organizing the information, searching it based on structural information, validating that documents have been formatted correctly, and most importantly, make it easy to write programs to translate the document structure into another markup language. For instance, programs have been written to translate the markup used for this document into several formats including HTML, raw text, LaTeX, and TeXinfo.

SGML tools have often tended to be expensive and complex to use, due to the intense needs of the early adopters. Not everyone is competent to create a DTD, or customize the "translation" from one markup scheme to another.

HTML

HTML is an example of a fairly simple SGML language; it is a fairly simple format designed to represent "web pages."

It is a simple language, and provides enough functionality that it has been spectacularly popular. If you can write in some language (most likely English), and can type a little bit, you can probably create a web page.

The most popular "new feature" of word processing systems in 1997 has been to export SGML files.

Unfortunately, for many users, SGML was not intended to provide many features in the area of specifying presentation. Netscape and Microsoft have "hacked" features into their web browsers to get them to do more in this area, which has led to increasing incompatibility between web browsers.

XML

HTML is a system that is not nearly sophisticated enough for the things people want to do with documents, and is more importantly not extensible.

SGML provides all the extensibility that anyone could ever desire, but is too sophisticated. With sophistication comes complexity, and SGML tools are indeed quite complex to use as one moves beyond highly-packaged end-user applications.

XML lies somewhere between HTML and SGML; it provides an extensible system for describing markup, which "beats out" HTML, but provides a simpler system than SGML, thus making it easier to build applications that use it.

Various companies appear to be putting substantial effort into XML, notably Microsoft, that claim that the next version of MS-Word will use XML as its "native format." XML applications like this may have the effect of popularizing it as a logical "upgrade" from the limitations of HTML.

Lout

Lout was targetted as a replacement for LaTeX. It is a typesetting system that produces Postscript as its output format; it uses a document structure syntax not dissimilar from predecessors such as SCRIBE and LaTeX .

The claim is that since Lout represents recent code created from scratch that it avoids the "ugliness" inherent in the macro rewriting system used to implement TeX and LaTeX. It is equally arguable that it does not contain the typesetting practices embedded in TeX; whether its Algol -like programming model for customization is as powerful as the marginally more LISP -like macro programming model that TeX provides is also a matter that occasionally brings controversy.

In any case, there is not the sizable Lout community providing the huge body of enhancements and custom document styles that there is for TeX and LaTeX.

Skribe

A Scheme-based document language; you can transform Texinfo docs into Skribe, and transform Skribe docs into HTML, XML, LaTeX, Postscript, and PDF.

PDF

Adobe created PDF (Portable Document Format) as a multi-purpose document format. Their Postscript format was designed as a printer control language; it is not useful as a manipulable document format as (in general) it does not gather together information such as document structuring. Their Distiller application turns Postscript files into PDF files, thus regaining limited document structure information. If an Adobe "composer" application is used to generate the PDF file, then additional structure information can be added, such as indexing, hyperlinks, as well as the ability to indicate text flow if the page size is changed. The format is not readily editable "by hand;" documents commonly contain embedded fonts, for instance.

Adobe's Acrobat browser allows one to browse and print PDF documents, using what structuring was used in the document.

Adobe suggests that PDF files are superior to SGML as they can contain similar document structuring information, and permit the designer to use sophisticated graphical tools to build precisely designed attractive documents.

Unfortunately, since Adobe has long been the primary source of tools for manipulating PDF files, the "inferior" SGML format has proven to be far more popular on the Web. PDF files are fairly commonly distributed for such things as:

  • Technology "White Papers"

  • Product brochures

  • Technical specifications

  • Electronic books

There is a significant and growing body of free software for generation and viewing of PDF files.

There are quite a variety of tools out there to generate PDF files.

Unfortunately, the usefulness of PDF files for the purpose of building archives of groups of PDF files that can be indexed and searched seems to be limited, which limits its usefulness for general document processing.

PDF viewers include:

Google
Contact me at cbbrowne@acm.org