Lessons learned:
Think four times before doing stream-based XML processing, even though it appears to be more efficient than tree-based. Stream-based processing is usually more difficult.
But if you have to do stream-based processing, make sure to use
robust, fairly scaleable tools like XML::Templates
, not sgmlspl. Of course it cannot be as
pleasant as tree-based XML processing, but examine
db2x_manxml and
db2x_texixml.
Do not use XML::DOM
directly for
stylesheets. Your “stylesheet” would become seriously
unmanageable. Its also extremely slow for anything but trivial
documents.
At least take a look at some of the XPath modules out there. Better yet, see if your solution really cannot use XSLT. A C/C++-based implementation of XSLT can be fast enough for many tasks.
Avoid XSLT extensions whenever possible. I don't think there is anything wrong with them intrinsically, but it is a headache to have to compile your own XSLT processor. (libxslt is written in C, and the extensions must be compiled-in and cannot be loaded dynamically at runtime.) Not to mention there seems to be a thousand different set-ups for different XSLT processors.
Perl is not as good at XML as it’s hyped to be.
SAX comes from the Java world, and its port to Perl (with all the object-orientedness, and without adopting Perl idioms) is awkward to use.
Another problem is that Perl SAX does not seem to be well-maintained. The implementations have various bugs; while they can be worked around, they have been around for such a long time that it does not inspire confidence that the Perl XML modules are reliable software.
It also seems that no one else has seriously used Perl SAX for robust applications. It seems to be unnecessarily hard to certain tasks such as displaying error diagnostics on its input, processing large documents with complicated structure.
Do not be afraid to use XML intermediate formats (e.g. Man-XML and Texi-XML) for converting to other markup languages, implemented with a scripting language. The syntax rules for these formats are made for authoring by hand, not machine generation; hence a conversion using tools designed for XML-to-XML conversion, requires jumping through hoops.
You might think that we could, instead, make a separate module that abstracts all this complexity from the rest of the conversion program. For example, there is nothing stopping a XSLT processor from serializing the output document as a text document obeying the syntax rules for man pages or Texinfo documents.
Theoretically you would get the same result, but it is much harder to implement. It is far easier to write plain text manipulation code in a scripting language than in Java or C or XSLT. Also, if the intermediate format is hidden in a Java class or C API, output errors are harder to see. Whereas with the intermediate-format approach, we can visually examine the textual output of the XSLT processor and fix the Perl script as we go along.
Some XSLT processors support scripting to go beyond XSLT functionality, but they are usually not portable, and not always easy to use. Therefore, opt to do two-pass processing, with a standalone script as the second stage. (The first stage using XSLT.)
Finally, another advantage of using intermediate XML formats processed by a Perl script is that we can often eliminate the use of XSLT extensions. In particular, all the way back when XSLT stylesheets first went into docbook2X, the extensions related to Texinfo node handling could have been easily moved to the Perl script, but I didn't realize it! I feel stupid now.
If I had known this in the very beginning, it would have saved a lot of development time, and docbook2X would be much more advanced by now.
Note that even the man-pages stylesheet from the DocBook XSL distribution essentially does two-pass processing just the same as the docbook2X solution. That stylesheet had formerly used one-pass processing, and its authors probably finally realized what a mess that was.
Design the XML intermediate format to be easy to use from the standpoint of the conversion tool, and similar to how XML document types work in general. e.g. abstract the paragraphs of a document, rather than their paragraph breaks (the latter is typical of traditional markup languages, but not of XML).
I am quite impressed by some of the things that people make XSLT 1.0 do. Things that I thought were impossible, or at least unworkable without using “real” scripting language. (db2x_manxml and db2x_texixml fall in the category of things that can be done in XSLT 1.0 but inelegantly.)
Internationalize as soon as possible. That is much easier than adding it in later.
Same advice for build system.
I would suggest against using build systems based on Makefiles or any form of automake. Of course it is inertia that prevents people from switching to better build systems. But also consider that while Makefile-based build systems can do many of the things newer build systems are capable of, they often require too many fragile hacks. Developing these hacks take too much time that would be better spent developing the program itself.
Alas, better build systems such as scons were not available when docbook2X was at an earlier stage. It’s too late to switch now.
Writing good documentation takes skill. This manual has has been revised substantially at least four times [5], with the author consciously trying to condense information each time.