XML Data Compression

XML Data Compression

last update: Jan. 25, 2006

Ongoing work with Greg Leighton.

Short description:

Development of algorithms for compressing XML documents: AXECHOP (very good compression rate; can not query compressed documents) and TREECHOP (can query compressed documents).

Last update: March 25, 2005

Longer description:

One of the drawbacks of XML is that it is quite verbose: an XML representation of a set of data can easily be ten times as large as a more economical representation of the data.

AXECHOP uses grammar compression to further improve the compression of XML data files.

AXECHOP compression strategy is:

Separate structure from content
Apply MPM algorithm to the structure string
Compress data value lists
Write compressed data to file

The AXECHOP compressed file format has three sections:

the encoded grammar used to derive the structure string
the contents of the element and attribute tables
the compressed contents of data lists for individual elements and attributes, along with those of the comment list, CDATA list, processing instruction list, and DOCTYPE list.

AXECHOP decompression strategy is:

Reverse the effects of the MPM to obtain the original structure string and data lists
The original XML file can be reconstructed by the decoder by conducting a single pass through the reconstituted structure string.

TREECHOP supports querying of compressed XML data without requiring prior decompression.

TREECHOP compression strategy is:

Generate the document tree for the input XML document
Assign a codeword to each tree node
Write Node Encodings to Compression Stream

Format of Encoding Stream
The content of the prologue container is written to the stream, followed by the encodings for all tree nodes in depth-first order. Finally, the content of the epilogue container is written to the stream. As content is added to the stream, it is compressed using gzip.

TREECHOP decompression strategy is:

Since tree node encoding is written to the encoding stream in a depth-first order, it is possible for the decoder to regenerate the original XML document incrementally.
A code table is used to keep track of the codewords processed so far and the data type and text value for the tree node associated with each of these codewords.

Querying of Compressed Data

Querying of the compressed data is done by registering one or more filters with the decoder. Each filter has two components: a query string specified using XPath-like syntax, and a handler class which is responsible for processing each query match as it is returned by the decoder. For example, a filter used to find the names of all students in the example XML document would have the query string 'select /class/students/student'.
As the decoder processes the encoding stream, it will attempt to calculate the codeword that satisfies the filter's query predicate. Once the codeword has been discovered, each occurrence of this codeword in the stream results in a notification being sent to the handler associated with the filter, along with the data contained in the next encountered leaf node.
Handlers offer a flexible way to handle query results. A handler could be configured to perform a simple action (such as outputting the query results to console), or some more complex operation such as inserting query results into a database or generating a new XML document containing the query results.

Future Work:

investigation of validating compressed (using TREECHOP) XML files, using a compressed XML Schema
design of a specialized xml editor. When a new document will be created using this editor, the user interface will show the XML text as always, but internally it will maintain a "compressed" version of the document.
Compression of JXTA messages, using a "schema-aware" mode, where codewords were pre-calculated for each core JXTA message type instead of having node names being transmitted adaptively in the compression stream.
Design and provide sample implementations of applications of Treechop for SVG
Creating animations using TREECHOP-compressed SMIL

Xml cOmpression Framework, XOF

development of a framework, which provides compressed streams (similar to GZIPStreams), with dynamical plugins that can be used to specify a compression algorithm (using a factory/property)
development of an XMLFilter library, similar to XMLReader
investigation of validating compressed (using TREECHOP) XML files, using a compressed XML Schema
investigation of transforming compressed XML documents using compressed XSLT stylesheets
incorporation of TREECHOP into XFILTERs, specifically building a framework, in which filters could be specified using XPATH
development of a specialized xml editor. When a new document will be created using this editor, the user interface will show the XML text as always, but internally it will maintain a "compressed" version of the document.
fixing AXECHOP (2^k problem)
XML-specific grammar based compression and Using PPM rather than BW
variable size code

LINKS:

RSJ XML compression

Paper on grammar compression

Publications:

Using XML Compression for WWW Communication. T. Müldner, G. Leighton, J. Diamond
IADIS International Conference WWW/Internet 2005

TREECHOP: A Tree-based Query-able Compressor for XML
G. Leighton, T. Müldner, J. Diamond
The Ninth Canadian Workshop on Information Theory, Montreal, June 2005

AXECHOP: A Grammar-based Compressor for XML
G. Leighton, J. Diamond, T. Müldner.
Poster at DCC 2005

Links:

Seminar

Return