Tuesday, December 15, 2009

Death by Complexity

This post will be incomprehensible.

Over the last three days I have spent 4.5 hours working on GDAL Ticket 3276 related to a failure to use external overviews with JPEG2000 compressed NITF files - in particular for NITF files containing more than one jpeg2000 image.

This bug was particularly hairy because it comes at the intersection of several things that are messy/complex in GDAL:

  1. JPEG2000 in NITF is implemented by creating JP2KAK driver dataset wrapping the jpeg2000 image data within the NITF file, and then using it's bands indirectly as bands for the NITF dataset. These band objects still mostly think of the jpeg2000 dataset as "their" dataset but for some purposes we really might wish they knew about the NITF dataset that appears to own them.
  2. NITF files can contain more than one image. Such multi-image files are treated as containing subdatasets, one per image. For the most part these subdatasets are intended to act as freestanding things, but they are also, to some extent related back to the single file on disk containing the subdatasets.
  3. Overviews in GDAL are mostly handled through an overview manager object embedded in the GDALDataset base class. However, JPEG2000 images have built-in overviews not handled through the overview manager.
  4. The PAM (Persistant Auxilary Metadata) mechanism is used via an intermediate GDALPamDataset class to provide a way of storing additional information about datasets that the intrinsic format does not support. This information is stored in an .aux.xml file associated with the main data file.
  5. In GDAL 1.7 a new capability was added to store PAM information for subdatasets in an .aux.xml file associated with the main data file so that subdatasets would work as much like a regular dataset as possible.
  6. In GDAL 1.7 support was added for building overviews on subdatsets. Since normally overviews would be stored in a .ovr file with the same basename as the main filename, it was necessary to do something special so that overviews of subdataset would have one .ovr file per subdataset. This was accomplished by keeping the overview file name in the .aux.xml file associated with the subdataset.
As an added bonus it was decided that it was desirable to support building generic external tiff overviews for jpeg2000 images and to use those in place of the jpeg2000 derived overviews which can be fairly slow to access.

It turns out that .aux.xml metadata was supported for NITF subdatasets, and it was possible to build overviews for nitf subdatasets, and it was possible to substitute external tiff overviews for jpeg2000 data streams in an NITF file. But it was not possible to substitute external tiff overviews for jpeg2000 data stream in an nitf file with multiple images (subdatasets).

It took me a long time to figure out what aspects were broken, and how the various components were supposed to work even though I had implemented most of them. The problem is that many of these capabilities are rarely used, don't fit the standard GDAL model well, and were individually quite complex to implement. The complexity as the various aspects come together is compounded.

Each of the capabilities was added for fairly good reasons - mostly in order to provide a seamless, and performant user experience for GDAL users. But in order to provide this consistent external set of behaviors we are having to build more and more complexity into parts of GDAL - to the point where I am not sure it is sustainable.

Interestingly, most of this complexity has grown without input from the broader GDAL community. The PAM mechanism predated the modern Project Steering Committee and it's RFC process. I added the changes for PAM on subdatasets, and some of the specific NITF driver capabilities without discussion with the PSC on the assumption that they are either bug fixes, or are sufficiently driver-local that the PSC does not need to be involved. Possibly if these changes had needed to be justified in public, push back on the complexity might have prevented some.

The specific problem itself was fixed, as is documented in the ticket, with changeset 18312 holding the core fix. However, this fix is (IMHO) just adding additional fragility to the existing house of cards.

I don't really have a solution to the growing complexity, but perhaps thinking about, and starting to open up the issues a bit is a first step to containing the danger. There is certainly a cautionary tale or two in here.

BTW, GDAL 1.7.0Beta 1 is now released - testing and bug reports are welcome!