graphein: Text Processors

1. Text Processors

1.1. Syntax Checking: xmllint

1.1.1. Catalog Files for xmllint

To run xmllint , add a line based on the following:

 
XML_CATALOG_FILES=[path-to-document-hierarchy-root]/xml-catalog file:///etc/xml/catalog; 
export XML_CATALOG_FILES 

to your $HOME/.bashrc file. This defines two files used by xmllint , as XML Catalog Files, a local file and the systemwide file. (The systemwide location was correct for SuSE 10.0; it may differ on other systems.) The [path to document hierarchy root] in the local file specification is the fully qualified pathname to the root of the current graphein document hierarchy checkout. For example, if you've got a big disk with a top-level directory in it named " /data0/", and you've checked out a copy of a graphein hierarchy to " /data0/circuitousroot, then this is the document hierarchy root.

You'll have to source .bashrc, of course.

This document hierarchy root should in fact contain an XML Catalog File, called xml-catalog, and this XML Catalog File should contain:

<<xml-catalog>>= 
<?xml version="1.0"?> 
<-- 
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" 
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> 
--> 
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> 
<group prefer="public" xml:base="file:///usr/local/xml/tei/"> 
<public publicId="-//TEI P5//DTD Main Document Type//EN" 
uri="tei-p5-schema/share/xml/tei/schema/dtd/tei.dtd"/> 
</group> 
</catalog> 

Basically, all this does is tell those XML-aware programs which use XML Catalog Files where the TEI DTD is. The TEI DTD should, of course, be where you say it is.

1.1.2. xmllint and XInclude

TEI P5 specifies "validity is verified after the resolution of all the <xi:include> elements." (§14.9.4, "Well-formedness and Validity of Stand-Off Markup"). So validating a document requires doing all XInclude processing first.

One way to validate would be to use Xerces, driven from Saxon, as in the XSL processing. It can be instructed to validate. Unfortunately, it seems to validate before doing the XInclude, not after. Behaving in this way, it (correctly from its point of view) decides that a TEI document containing XInclude is not, as such, valid.

So back to xmllint. It does handle XInclude, and can be instructed to "postvalidate." Unfortunately, when it does the XInclude, it adds an xmlns attribute for the namespace of the target entity (the TEI namespace, here) to the XIncluded element. I don't think that it needs to do this, but it does. This having been done, the postvalidation fails because "xmlns" is not generally a valid attribute in the TEI.

There is an "-nsclean" option to xmllint which will remove such extra xmlns declarations. Unfortunately, it only does so if the offending declaration has the same namespace prefix as the default namespace. Actually, I think that here it does ("tei:"), but xmllint apparently doesn't think so. You can add "tei:" namespace prefixes to the overall <TEI> tags, and this fixes xmllint. It breaks Xerces/Saxon, however. Sigh.

So, an ugly hack: use an Awk script to remove the extra xmlns attributes manually. Here's the script:

<<xmllint-hack.awk>>= 
{ 
for (i = 1; i <= NF; i++) { 
if ($i == "xmlns=\"http://www.tei-c.org/ns/1.0\">") { 
printf ">" 
} else if ($i != "xmlns=\"http://www.tei-c.org/ns/1.0\"") { 
printf "%s ", $i 
} 
} 
printf ("\n") 
} 

Here's how it is used in the makefile:

<<make-lint>>= 
$(LINTS): %.lint : %.tei 
rm -f temp1.linthack temp2-linthack 
xmllint --nonet --xinclude $*.tei > temp1-linthack 
awk -f xmllint-hack.awk temp1-linthack > temp2-linthack 
xmllint --nonet --noout --valid temp2-linthack 
rm -f temp1-linthack temp2-linthack 
if [ $$? -eq 0 ]; then touch $*.lint; fi 

1.2. XSLT TEI to HTML: Saxon, xml-commons-resolver

1.2.1. XML Commons Resolver

As configured here, the Saxon-B XSL processor will invoke the Xerces XML parser to do the parsing it requires. This in turn requires access to the DTD (for Xerces) and the XSL Stylesheet (for Saxon). The actual DTDs are linked up through the use of XML Catalog Files, as described above for xmllint. I'm not quite sure how the Catalog File really addresses the stylesheet location, since it's in the current directory (this may be the source of the problem I describe below).

The Catalog File is the same. What differs (vis a vis xmllint) is the way the programs find out about it. These are Java programs, so nothing is going to be simple. The XML Catalog file is itself reference in a file called /etc/java/resolver/CatalogManager.properties. This file will have, inter alia, a line containing:

 
catalogs=file://[path-to-document-hierarchy-root]/xml-catalog 

The java "classpath" for the incantation to get all of this to run must contain both the directory in which this file resides ( /etc/java/resolver/) and the fully qualified pathname of the XML Commons Resolver (mine is in /usr/share/java/xml-commons-resolver-1.1.jar; yours may be somewhere else).

Xerces 2.9.0 comes prepackages with the XML Commons Resolver 1.2, but I can't seem to get it to work (it seems to fail to find the stylesheet in the current directory, even though this doesn't seem to be something that has to be specified). So I still use the XML Commons Resolver 1.1, which I picked up from the Internet.

See below (Xerces/Saxon) for the full incantation to cause them to use the XML Commons Resolver (the "-x", "-y", and "-r" parameters).

1.2.2. Xerces 2.9.0 and Saxon-B 8.8

I updated to Xerces 2.9 and Saxon-B 8.8 in the process of debugging the use of XIncludes. These versions seem to work for this; earlier versions may not.

To install Xerces 2.9.0 and Saxon-B 8.8, obtain them, and unzip them somewhere. I put them in:

 
/usr/share/java/Xerces-j-2.9.0/ 
/usr/share/java/saxon-8.8/ 

<<make-htmls>>= 
$(HTMLS): %.html : %.tei $(STATIC_FILES) $(LINKING_IMAGES) dependencies.dep images-scaled.dep 
(for resolution in 0; do \ 
java \ 
-Djavax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl 
\ 
-Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl 
\ 
-Dorg.apache.xerces.xni.parser.XMLParserConfiguration=org.apache.xerces.parsers.XIncludeParserConfiguration 
\ 
-classpath "/usr/share/java/Xerces-j-2.9.0/xerces-2_9_0/xml-apis.jar:/usr/share/java/Xerces-j-2.9.0/xerces-2_9_0/xercesImpl.jar:/usr/share/java/xml-commons-resolver-1.1.jar:/usr/share/java/saxon-8.8/saxon8.jar:/etc/java/resolver" 
\ 
net.sf.saxon.Transform \ 
-x org.apache.xml.resolver.tools.ResolvingXMLReader \ 
-y org.apache.xml.resolver.tools.ResolvingXMLReader \ 
-r org.apache.xml.resolver.tools.CatalogResolver \ 
-u \ 
$*.tei graphein-tohtml.xsl \ 
own-basename=$* \ 
css-basename=main \ 
scale-factor=$$resolution \ 
location=`./find-location.sh $*` \ 
depth-in-hierarchy=`./current-depth-in-hierarchy.sh` \ 
depth-in-category=`./current-depth-in-category.sh` \ 
public-or-private=$(P) > $*.html-tmp; \ 
awk -f ./insert-img-dimensions.awk $*.html-tmp > $*-$$resolution.html; \ 
rm -f $*.html-tmp; \ 
if [ $$resolution -eq 0 ]; then \ 
rm -f $*.html; \ 
ln -s $*-0.html $*.html; \ 
fi; \ 
done ) 

1.3. XSLT TEI to XSL-FO to PDF: Saxon, FOP

.