XML Data Compression

last update: Jan. 25, 2006

Ongoing work with Greg Leighton.

Short description

Longer description

Future Work

Publications



Short description:

Development of algorithms for compressing XML documents: AXECHOP (very good compression rate; can not query compressed documents) and TREECHOP (can query compressed documents).

Last update: March 25, 2005


Longer description:

One of the drawbacks of XML is that it is quite verbose: an XML representation of a set of data can easily be ten times as large as a more economical representation of the data.


AXECHOP uses grammar compression to further improve the compression of XML data files.

AXECHOP compression strategy is:

  1. Separate structure from content
  2. Apply MPM algorithm to the structure string
  3. Compress data value lists
  4. Write compressed data to file

The AXECHOP compressed file format has three sections:

AXECHOP decompression strategy is:

  1. Reverse the effects of the MPM to obtain the original structure string and data lists
  2. The original XML file can be reconstructed by the decoder by conducting a single pass through the reconstituted structure string.

TREECHOP supports querying of compressed XML data without requiring prior decompression.

TREECHOP compression strategy is:

  1. Generate the document tree for the input XML document
  2. Assign a codeword to each tree node
  3. Write Node Encodings to Compression Stream

Format of Encoding Stream
The content of the prologue container is written to the stream, followed by the encodings for all tree nodes in depth-first order. Finally, the content of the epilogue container is written to the stream. As content is added to the stream, it is compressed using gzip.

TREECHOP decompression strategy is:

  1. Since tree node encoding is written to the encoding stream in a depth-first order, it is possible for the decoder to regenerate the original XML document incrementally.
  2. A code table is used to keep track of the codewords processed so far and the data type and text value for the tree node associated with each of these codewords.

Querying of Compressed Data

  1. Querying of the compressed data is done by registering one or more filters with the decoder. Each filter has two components: a query string specified using XPath-like syntax, and a handler class which is responsible for processing each query match as it is returned by the decoder. For example, a filter used to find the names of all students in the example XML document would have the query string 'select /class/students/student'.
  2. As the decoder processes the encoding stream, it will attempt to calculate the codeword that satisfies the filter's query predicate. Once the codeword has been discovered, each occurrence of this codeword in the stream results in a notification being sent to the handler associated with the filter, along with the data contained in the next encountered leaf node.
  3. Handlers offer a flexible way to handle query results. A handler could be configured to perform a simple action (such as outputting the query results to console), or some more complex operation such as inserting query results into a database or generating a new XML document containing the query results.


Future Work:

Xml cOmpression Framework, XOF



LINKS:

RSJ XML compression

Paper on grammar compression


Publications:

Using XML Compression for WWW Communication. T. Müldner, G. Leighton, J. Diamond
IADIS International Conference WWW/Internet 2005

TREECHOP: A Tree-based Query-able Compressor for XML
G. Leighton, T. Müldner, J. Diamond
The Ninth Canadian Workshop on Information Theory, Montreal, June 2005

AXECHOP: A Grammar-based Compressor for XML
G. Leighton, J. Diamond, T. Müldner.
Poster at DCC 2005


Links:

Seminar


Return