Sep 22, 2010
Corpora parsed by Alpino consist of a collection of dependency trees stored as XML data. Since corpora are usually large and XML data redundant, it is often attractive to archive and compress corpora.
For archiving a KISS approach is taken: all XML files are concatenated to one file and an index file is added, containing filenames and their positions/sizes in the data file.
Compression of the data file is less obvious: common compression formats such as gzip or bzip2 do not suffice, since they do not allow for random access. So, if we want to retrieve a random block of uncompressed data, all preceding data needs to be decompressed. Fortunately, an alternative exists: a variation on gzip, named dictzip sacrifices a bit of compression efficiency for semi-random access to uncompressed data.
Ignoring implementation details, the principle of dictzip files is very simple: data is compressed in chunks of a fixed size. The compressed data then consists of compressed chunks, where we know the size of the uncompressed chunk. To be able to find the compressed chunks, a list of chunk sizes is added to the header. This list allows us to find the location of each chunk in the compressed data.
Finding the correct starting chunk for a position in the uncompressed data is simple:
-- Get chunk and position within the chunk. chunk pos chunkSize = pos / chunkSize posInChunk pos chunkSize = pos % chunkSize -- Get the position of the chunk in the compressed data posInZData chunks chunk = chunks ! chunk
This extension can be implemented in a gzip-compatible manner. The gzip file format specification allows for an optional information fields located after the header, which is convenient for storing the chunk sizes. As an end-result, the dictzip file can also be decompressed with gzip.
The full specification of the dictzip format can be found in the dictzip manual page. Note that there is an error in this documentation and implementation: it claims that the compression buffer must be at least 10% + 12 bytes larger than the input buffer. However, 0.1% + 12 bytes is sufficient (and corresponds to the zlib documentation).
For a Qt-based tool that we are currently working on, I wrote a QIODevice-derived class to read and write dictzip files. This functionality is now available as a library at: http://github.com/danieldk/qdictzip
Update (October 2012): This is now available as part of the alpinocorpus library (see DzIstream/DzOstream).