XML File Versioning Options

Background

It takes about 6000 lines of XML to describe the flight instrument for detModel. For clarity and (relative) ease of maintenance, this is currently divided into about 20 physical files, not including the DTD, which is in yet another physical file. Some of the xml files correspond to logical divisions, such as the materials description, or definitions of integer constants used in the description of CAL; others don't. The include mechanism used, external entities, makes no requirements on the form of what is included. It is just literally substituted into the including file in a preprocessing (i.e., pre-parsing) step, just as is the case for C++ #include. How can we keep track of exactly which files, and which versions of those files, were used? There are at least 3 approaches.

Unofficial Parse

The files used can be discovered by starting from the top file (e.g., flight.xml) and doing some simple string processing to find referenced external entities and their definitions. The versions of the referenced files can either be determined by starting with the package release tag or can be read from the files themselves, each of which should have an embedded CVS $Header$ macro.

Entity Resolver

The Xerces parser allows the application to be involved in the process of finding and including a file by means of an Entity Resolver, a user routine that gets called back when an entity is encountered in the xml file being parsed. The callback could keep track of all filenames. As in the first approach, CVS version could be determined either from the release tag or by scanning the file itself for a $Header$ macro.

CVS $Header$ as XML Attribute Value

All of the CVS $Header$ macros in current XML files are embedded in comment lines, but they can also appear as attribute values, hence something the application naturally would encounter. For those physical files which also correspond to a single XML element and its children, the attribute can be associated with that top element, however some of the included files don't have this simple structure. A new element would have to be added to our dtd to carry the attribute; having added it, we might prefer to use it in all included files. Then some part of the infrastructure, such as the GDDDocManager class, could read all the values of the $Header$ macros and make them available to the rest of the application.

Pros and Cons

Practically everything I've described depends on embedding a CVS macro in each file. It would be difficult to detect a missing macro while doing the XML parse. Verification that each file really does have a macro is probably best done by a standalone program or script during the build of the package.

Any of these schemes could probably be made to work reliably in the controlled environment of production running. Since xml files are just editable ascii, however, nothing is going to be foolproof in a developer's private space.

The Unofficial Parse requires little or no change to existing xml files and could be used for any XML application which uses external entities to include files. A minor disadvantage is that the unofficial parse and the official parse take place at different times, so that there is a small chance they don't actually see the same set of files. More serious is that the unofficial parse is duplicating some of the work already being done by Xerces, and to do it right might not be as simple as it seems.

The Entity Resolver method has several advantages:

The one disadvantage is that the Entity Resolver only gets the file name and path for free; more work has to be done to discover the CVS versions of the files.

The XML Attribute method could do the complete job of getting full versioning information to the application with rather straightforward C++ code and no scripting at all. However, the structure of the XML files would change somewhat to accomodate the new versioning element, probably necessitating at least minor changes (to ignore the new element) in detModel and perhaps xmlUtil. For the scheme to work properly, existing included files would have to be modified and writers of new included files would have to remember to include the new versioning element. There is no straightforward way for a client of the parser to verify that each physical file has such an element. Anything along these lines would have to involve an entity resolver, or would have to be done "offline," e.g., by a script as part of the package build. (However, it would be even more difficult to find a $Header$ macro embedded in an XML comment.)


Last modified: