SGML:

What It Is & Why It's Good for Braille (Though we still need UBC)
by Joseph E. Sullivan

Member, International Committee on Accessible Document Design
Member, Texas Commission on Braille Textbook Production
Chairman, Committee II of BANA Unified Braille Code Research Project
President, Duxbury Systems, Inc.

March 9, 1993

This document is hereby placed in the public domain.

Introduction

Many of today's technologies seem to evolve so fast that there's no keeping up with them. For instance, practically all of the latest and greatest computers of the 1970's are no longer seen outside museums and reruns of the science fiction movies of that era. Many computers of the early 1980's are now serving as bookends. Even last year's computers suffer noticeably by comparison with this year's, and so it goes. Compared with computers, braille -- invented in the early nineteenth century -- would seem to be a model of solid and mature technology, one that does not and probably should not evolve very rapidly. While that is basically correct, braille too can profit from the inevitable march of progress, and at the present time there are two technologies being developed that are, in my opinion, of particular importance to the entire community of persons who produce or use braille.

Those technologies are SGML (Standard Generalized Markup Language) and UBC (Unified Braille Code). Once you get past the alphabet soup, the concepts behind these technologies are not really difficult, even though they are a bit "technical" if you get down to some of the details. This paper concentrates primarily on SGML, attempting to explain it in nontechnical terms, and its benefits for braille, to the layman who may not know much about either markup languages or braille. It goes on to explain why certain problems with braille are not solved by SGML, and it touches on UBC as a promising complementary technology. For those readers who are already well versed in one or another aspect of these subjects, it will be obvious (I hope) which paragraphs may be skipped over.

What is SGML?

Since the 1960's, one of the more routine uses of computers has been to format text, that is to produce "finished" documents from text stored as a file on the computer. To make this a bit more concrete, let us assume that we wish to produce a simple document that has several sections, each section having a heading and one or more ordinary paragraphs. Let us further assume that, in the print edition, we want the section headings to be centered and the paragraphs to be separated from each other, and from the headings, by a skipped line. Then we might start with a computer file that contains the following text:

  [start-centering] First Heading [end-centering] [skip-line] This
  is the text of the first paragraph, which would normally be much
  longer but we do not wish to belabor this example.  [skip-line]
  This is the text of a second paragraph.  Again we will be
  unusually brief.  [skip-line] [start-centering] Second, Somewhat
  Longer Heading [end-centering] [skip-line] This is the first
  paragraph in the second section...

The items enclosed in brackets in the above example are obviously not part of the text that is intended for the eventual human reader, but rather "formatting codes" to be used by the computer program that is to do the formatting. The formatting program uses them to tell what text to lay down in ordinary paragraphs, what text to center, and where to skip lines. Those codes, and the text itself, are all that is important in this file. In particular, line endings in the original file signify nothing more than a break between words, and so are exactly equivalent to spaces; to emphasize this, we have deliberately shown the file as running on in a kind of stream that is a bit hard to read (sorry about that).

Starting with the above file, the output of the formatting program would be something like this:


                          First Heading

  This is the text of the first paragraph, which would normally
  be much longer but we do not wish to belabor this example.

  This is the text of a second paragraph.  Again we will be
  unusually brief.

                 Second, Somewhat Longer Heading

  This is the first paragraph in the second section...

"Codes" may also be called "tags" or "markup". In the example given, the codes are obviously related directly to the appearance, or format, of the resulting document, and so would be called format codes or appearance-oriented markup. This general kind of markup is no doubt familiar to most users of word-processing software. In WordPerfect, for example, you need only press the "Reveal Codes" function to see markup much like our example, although the actual codes are different. There are a great many computerized formatting systems based on this type of coding, some of them designed to accommodate braille formatting as well as print, and much useful work has been and continues to be done by them. Even if you do not use such computerized formatting systems, if you have had occasion to re-type several pages of a paper just because of a few editorial changes, you can appreciate how much labor is saved by having the computer re-do the tedious page formatting after you have made just the essential text changes in the original file.

Superficially, SGML codes are a lot like format codes, but they have important additional advantages. Again to keep this in the concrete, let us re-code our original example SGML-style:

  [start-heading] First Heading [end-heading] [start-paragraph] This
  is the text of the first paragraph, which would normally be much
  longer but we do not wish to belabor this example. [end-paragraph]
  [start-paragraph] This is the text of a second paragraph.  Again
  we will be unusually brief.  [end-paragraph] [start-heading]
  Second, Somewhat Longer Heading [end-heading] [start-paragraph] This
  is the first paragraph in the second section... [end-paragraph]

The first thing we notice is that now the codes no longer refer to skipped lines, centering and such matters of appearance but rather to the CONTENT of the material, that is the nature of the text in the logical organization of the document. This corresponds to the way that the original author thinks about the document. That author may not know and may not care what formatting devices, such as line skips and centering, may eventually be used to format the document for presentation, but he or she (or an allied specialist) can still enter these content-oriented codes because they correspond to natural divisions of the material itself. When it comes time to publish, a person whose speciality is document appearance can decide how to turn these codes into formatting rules. There are various ways of doing this. One way might be to specify what the SGML tags "mean" in terms of formatting tags, for example:

[start-heading] means: [skip-line] [start-centering]
[end-heading] means: [end-centering] [skip-line]
[start-paragraph] means: [skip-line]
[end-paragraph] means: [skip-line]

These rules aren't quite right, because if they are taken too literally then, for example, we would wind up with two skipped lines between paragraphs. But we did promise to stay clear of technical details; you get the idea.

With respect to the eventual formatting, then, SGML coding may be said to be "indirect". That is, you can't really tell, from the original file alone, what the document will look like, but must reference the rules in order to figure that out. This is sometimes a disadvantage, but in many important situations it is more than offset by advantages that also derive from this indirectness. One is that, as we have seen, the text coding on the one hand, and the drawing up of the formatting rules on the other, can be done by separate people each of whom is especially appropriate for the particular task. Another is that it remains possible for each to work independently of the other. The author may change the text to improve or update the content, while the document designer may alter the document formatting rules, yet neither gets in the other's way. An important corollary is that eventually there may come to be multiple sets of formatting rules, each for a particular type of document, such as a magazine article set in columns, a hard-cover book, and a paperback -- some of which may not even have been contemplated at the time of original composition.

That's why SGML is "generalized" markup: it does not specifically determine the format but rather can work with many different format styles.

People who work regularly with word processors may recognize a closely related concept here, namely the notion of a "style", as they are now usually called. (Especially in older word processors, the term "macro" may be used with much the same meaning.) Basically, defining a style is equivalent to giving a name to a collection of formatting codes. After defining a number of styles, the codes that you use thereafter in the text may consist only of the style names, with no further need to reference the basic formatting codes directly. If, for example, the styles defined had the same names as our four SGML tags, and were equated to sets of format codes similarly to the four rules listed above, then the coding of the file could be entirely indistinguishable from SGML, and have the same quality of indirectness and consequently the other, mostly beneficial, qualities that derive from indirectness.

Thus styles, consistently used, can have many of the same benefits of SGML. However, with styles there is generally no mechanism to enforce consistent or logical use. Thus, for example, nothing would prevent using a "start-paragraph" tag between a "start-heading" tag and its corresponding "end-heading" tag. It is chiefly on this point that SGML goes beyond styles. With SGML, the set of tags that can be used, and the ways that they can be used, is precisely defined in a separate file called a Document Type Definition (DTD). For example, the DTD defined for our simple case would not only list the four desired tags, but would undoubtedly also specify that paragraphs may not occur within headings, nor vice versa. Thus the user of an SGML system is spared the possibility of making illogical coding errors. Consequently, from the perspective of the DTD designer, it is possible by means of the DTD to enforce a required "structure" on the document. That is, he or she may specify, for example, the order in which various kinds of headings can appear, whether they are mandatory or not, and many other matters that ultimately determine what authors may do when writing documents governed by that DTD.

With respect to what can be in an SGML document, then, we may say that it has an "enforced structure". As with indirectness, this quality can sometimes be a drawback, because you can't simply introduce a new tag into the text file, any more than you can directly specify an imaginative new format. Rather, you must first define the tag (or more properly the start-end tag pair, enclosing a "text element") in the DTD, and also specify how it must be used; only then may you use the tag in a document associated with that DTD. This may be viewed as a loss of flexibility or at least spontaneity, but it is a positive gain in many cases where it is necessary to ensure uniformity over large classes of documents. Examples of such cases would be military procedure manuals and the software user's guides within a series for a computer system. It is obvious, then, why SGML is most popular in those organizations that must regularly issue such documents.

Is SGML a new idea?

As computer-related technologies go, SGML is rather old. Deriving from work by William Tunnicliffe, Charles Goldfarb and others in the late 1960's and through the 1970's, it has become both a standard of the International Standards Organization and the technology behind several commercial products.

Today, though, SGML is still not widely used for general document production. Rather, its use remains largely tied to those situations where enforced structure is highly desirable. Most documents are produced in circumstances where there is less need for formality, where the task at hand is primarily to produce one specific edition, and where it may even be desirable to use innovative format techniques to enhance document content through visual aesthetics and interest. These characteristics run counter to SGML's properties of indirectness and enforced structure. Consequently, it is not surprising that the great bulk of document production still takes place using systems that are appearance-oriented. All of the most popular word processors used in business offices, such as WordPerfect, Microsoft Word and Ami Pro, and all of the most popular page composition programs favored by professional publishers, such as CorelDraw and PageMaker, are of that kind.

Despite the fact that appearance-oriented methods still dominate after all these years, SGML has remained an important technology, and more to the point now appears to be gaining in importance as its strengths are more recognized and needed. As always, things are changing. Increasingly, publication in a specific paper format is not the end of the story for a typical document. Instead, it is now more likely that the information will be retained in a data base to allow for electronic reference, and also that it will be retained for possible publication in alternative formats, including those specifically adapted for persons with disabilities. Information in a database is more useful when it is structured, so that for example you can retrieve just those documents where a particular person is listed as author, without retrieving those where that person is otherwise mentioned. As we have seen, both this kind of structure and the indirectness that favors alternative formats are defining strengths of SGML. Consequently, there seems to be increasing interest in SGML, with attendant progress in technology, especially along the lines of combining SGML with appearance-oriented methods so that it is easier to experience the best of both. For example, SoftQuad's Author/Editor, which is firmly SGML-based, nevertheless allows viewing and working with the document in formatted form on screen, as is typical of appearance-oriented word processors, and also provides many other convenience functions to reduce user involvement with the technical aspects of markup. We are also seeing general-purpose DTD's being adopted as standards, so that organizations and authors typically do not need to design DTD's unless their needs are quite specialized. Lastly, from the appearance-oriented end of the spectrum, connections to SGML or at least SGML-like facilities are beginning to appear. Thus the direction that all of this is going is clear, in my opinion, even if the final form of the technology is not so clear because there is quite a ways yet to go.

What can SGML do for Braille?

As already discussed above, the indirect way that SGML specifies format makes it especially suited for cases where a document is to be produced in multiple formats. It is really just a corollary of this principle that makes SGML beneficial for braille production, for in most cases braille is an alternative publication format for a document already produced in print.

The treatment of ordinary paragraphs provides an example that is both simple and of practical importance. Let us assume that the print copy skips lines between paragraphs, as in our example of the previous section. Even though that is now probably the most common format used in print, it is not customarily used in braille; rather, simple paragraph breaks are usually shown only by an indentation of the first line, without skipping a line. On the other hand, there are cases where a skipped line in print could correspond to a skipped line in braille -- around headings and tables, for example. Consequently, a file coded for the appearance of the print document, as in the first coding presented above, cannot be used just as it is. Nor can it easily be converted by automatic means, because some of the line skips are for paragraph breaks and some are for other purposes. Thus, especially in practical cases that are naturally more extensive and complex than our example, considerable human labor, of a tedious nature, is needed to help sort out the various purposes of the skipped lines and other print formatting devices. By contrast, the SGML file contains exactly the needed information: paragraphs and headings are clearly and separately identified and so the appropriate braille formatting is readily automated.

Perhaps this is the place to mention that, from what I have observed, a saving of human labor in braille production generally translates into more and better braille, that is to more productive and interesting jobs for those who work with braille, not lost jobs!

Coming back to the subject, another way of looking at SGML files is that they are more useful than appearance-coded files because they contain more information. That is, they not only determine (indirectly) WHAT is done in the way of formatting but WHY, in terms of the author's intended structural divisions. That additional information happens to be quite useful for braille production.

Is the application of SGML to Braille a new idea?

The potential benefits of SGML have long been evident to the community involved with automating braille production. Back in the early 1980's, the National Braille Press carried out a project called POINTS, under the sponsorship of the Library of Congress, to investigate among other things the viability of using SGML (then called "generic coding") as a common target for conversion of text coming from various typesetting systems and other disparate sources, and as a common source for text to be published in various alternative formats including braille. By obvious analogy, this idea was called the "hourglass" principle at the time, with SGML serving as the narrow working center of the hourglass. All of us involved with that project believed in SGML on theoretical grounds, and at the end of the project felt that our beliefs had been substantially confirmed.

It must nevertheless be acknowledged that, when it comes down to the production techniques developed in the POINTS project that continued to be used, most of them actually "went around" SGML in most practical cases. That is, the print format coding was, and still is, mostly converted directly to braille format coding without ever going through a SGML stage. Why? Because most real braille production is concerned with just that specific format, and takes place under get-the-job-done constraints on the use of personnel and other resources. Many different kinds of literature, some of them requiring innovative format treatment in braille, must be processed. In other words, the same practical realities that have led to the dominance of appearance-oriented methods in the print world are also operative in the braille world. Under those circumstances, conversion into SGML first seems like unnecessary overhead, an extra step into an indirect language, that is insufficiently rewarded by SGML's benefits because additional production formats are not often contemplated. Even when one other adaptive format, such as large print, is regularly produced, the practical balance has seemingly not yet tipped towards widespread use of SGML.

Despite these sobering realities, it has long been quite plain that SGML would work well for the braille community if only it were more widely used in the print community, that is if SGML files, coded for a commonly understood DTD, were more generally available as the starting-point for braille work. As discussed earlier, the print world for its own reasons seems to be showing increasing interest in SGML, and so that long-awaited condition may finally be on the horizon. Moreover, we can point to at least three other factors in our favor. First, there are initiatives, such as the recent "Texas Braille Bill", that promote the regular transfer of electronic media from the print publishing industry to alternative-format producers. This has the effect of linking the two kinds of organizations economically, so that the efficiencies of SGML are more operative at the time of original coding. Secondly, the efforts of the International Committee on Accessible Document Design (ICADD) have led to standard DTD's that serve as a well-defined conversion target, in effect an accepted concrete declaration as to "this is what we want", so that the designers of systems for print production can begin to link them to the needs of the braille world. Thirdly, stimulated by those other factors, both commercial efforts and academic projects, such as the one on mathematics braille at Bradford University, now seem more centered on SGML technology.

Will SGML solve all Braille production problems?

No. Many of the problems associated with braille production are more properly associated with the rules for transcribing the text itself, as distinct from arranging the text on the page. As an example, the current rules of English literary braille require that acronyms and abbreviations be treated differently from regular English words that happen to be capitalized. It is beyond the scope of this paper to detail the reasons behind this distinction, though it should be noted that they are well rooted in braille tradition and can be seen as neither arbitrary nor foolish when considered in the entire context of that tradition. That distinction, however, can at times be difficult even for human transcribers. If, for example, the following capitalized headline were to appear in a newspaper in the United States:

  EUROPE JOINS US IN TRADE PACT

it would be impossible to tell whether "US" stood for "United States" or was simply a pronoun, yet the braille rendering would require such a distinction. Automated conversion, perhaps needless to say, runs into difficulties with this distinction even in the much more numerous cases where human judgment is not so challenged.

It is easy to image an SGML "solution" to this problem. We could invent a tag pair to make the required distinction; for example we could project that the original author or someone else along the way would enter something like

  EUROPE JOINS [begin-acronym] US [end-acronym] IN TRADE PACT

when "United States" was intended, leaving the unannotated case to imply that the pronoun was intended. It could even happen that such a tag could serve some print purposes, such as in a book where distinctive "small caps" are used for acronyms and regular capitals for other purposes. Realistically, though, the need for such a distinction would not be commonly felt in preparing print, and so could not be relied upon as a solution for braille purposes. This fact becomes even more obvious when we consider other kinds of distinctions required under current braille rules, some of which are even harder to relate to any conceivable print need. For example, the word "tuberose" would be brailled differently when it is a noun (a tuberous plant) as opposed to when it is an adjective (a variant spelling of "tuberous"). For another example, depending on the specific braille code being used and other judgment factors, the expression "(b)" (not counting the quotes) might be brailled differently in each of the following circumstances: (1) when it is used in parallel with "(a)" etc. as an enumerator in a list or outline; (2) when it is a parenthesized reference to the letter b; (3) when it is an expression in a mathematical context; and (4) when it is an excerpt from a computer program.

How can UBC help?

These considerations, and others, have given rise to the "Unified Braille Code" (UBC), which is not yet an official code but a research project of the Braille Authority of North America (BANA). UBC aims at defining the relationship between print symbols and braille symbols in a way that minimizes ambiguity and judgment problems, in BOTH directions of conversion, and further encompasses the needs of technical literature, all while preserving all the essential characteristics of current English literary braille.

It would take us too far afield in this paper to elaborate further on UBC; that of course has been done elsewhere as to motivation, and is still under development as to methodology. Suffice it to say that SGML and UBC are both positive developments for braille; that neither is sufficient in that each solves problems that the other does not; and consequently that they are entirely complementary, which is the main point of this paper.

As a concluding footnote, in case it may appear SGML and UBC taken together bid fair to solve all braille problems, it may be worth mentioning that they do not. When we consider all the ramifications of foreign languages (even in English context), music notation, and graphics, we find ourselves at the threshold of the problems that we will be pondering for many years to come. Human experience and judgment remain a valued part of the braille transcription process, all the more so because an increased volume of automated transcription can only be accompanied by an increased incidence of the "hard" problems that only people can solve. SGML and UBC will, we expect, free those people to concentrate on those kinds of problems.

Further information on SGML and ICADD

The SGML Handbook, by Charles F. Goldfarb. (Clarendon Press, 1990.) This is the "bible" on the subject, a "practical aid for people who want to understand, use and implement ISO 8879 -- the SGML standard". It does contain introductory concept papers and some application examples (sample DTD's). It also lists additional sources, among them: The International SGML Users' Group (Secretary: Stephen G. Downie, c/o SoftQuad Inc., Toronto, Ontario, Canada); Graphics Communication Association (GCA) (Arlington, Virginia, USA); An introductory video by Yuri Rubinsky and Marc Giacomelli (available from GCA).
Electronic Manuscript Preparation and Markup, by National Information Standards Organization (Bethesda, Maryland, USA). This is a technical document, being the text of standard Z39.59-1988, a DTD generally known as the AAP (Association of American Publishers) DTD.
Reference Manual on Electronic Manuscript Preparation and Markup, by the Association of American Publishers (available from the Electronic Publishing Special Interest Group [EPSIG], Dublin, Ohio, USA; tel. 614-764-6000). This explains how to use the above AAP DTD standard, for publishers, authors, and editors, in a format that is simpler to follow.
Author's Guide to Electronic Manuscript Preparation and Markup, by the Association of American Publishers (also available from EPSIG). This document has the same stated purpose as the foregoing, but is less detailed; it describes only the most basic rules and tags.
International Committee for Accessible Document Design (ICADD) Statement of Purpose. (available from Recording for the Blind R & D, Missoula, Montana, USA; tel. 406-728-7201)

Note that, since the above was written, the Unified Braille Code has become a research project of the International Council on English Braille (ICEB).