XML Data Compression
last update: Jan. 25, 2006
Ongoing work with Greg Leighton.
Short description
Longer description
Future Work
Publications
Short description:
Development of algorithms for compressing XML documents: AXECHOP (very good compression rate;
can not query compressed documents) and TREECHOP (can query compressed documents).
Last update: March 25, 2005
Longer description:
One of the drawbacks of XML is that it is quite verbose: an XML representation of a set of data
can easily be ten times as large as a more economical representation of the data.
AXECHOP uses grammar compression to further improve the compression
of XML data files.
AXECHOP compression strategy is:
- Separate structure from content
- Apply MPM algorithm to the structure string
- Compress data value lists
- Write compressed data to file
The AXECHOP compressed file format has three sections:
- the encoded grammar used to derive the structure string
- the contents of the element and attribute tables
- the compressed contents of data lists for individual elements and attributes, along with those
of the comment list, CDATA list, processing instruction list, and DOCTYPE list.
AXECHOP decompression strategy is:
- Reverse the effects of the MPM to obtain the original structure string and data lists
- The original XML file can be reconstructed by the decoder by conducting a single pass through
the reconstituted structure string.
TREECHOP supports querying of compressed XML data without requiring
prior decompression.
TREECHOP compression strategy is:
- Generate the document tree for the input XML document
- Assign a codeword to each tree node
- Write Node Encodings to Compression Stream
Format of Encoding Stream
The content of the prologue container is written to the stream, followed by the encodings for all tree nodes in
depth-first order. Finally, the content of the epilogue container is written to the stream. As content is added
to the stream, it is compressed using gzip.
TREECHOP decompression strategy is:
- Since tree node encoding is written to the encoding stream in a depth-first order, it is possible
for the decoder to regenerate the original XML document incrementally.
- A code table is used to keep track of the codewords processed so far and the data type and text
value for the tree node associated with each of these codewords.
Querying of Compressed Data
- Querying of the compressed data is done by registering one or more filters with the decoder.
Each filter has two components: a query string specified using XPath-like syntax, and a handler class which is
responsible for processing each query match as it is returned by the decoder. For example, a filter used to find
the names of all students in the example XML document would have the query string 'select /class/students/student'.
- As the decoder processes the encoding stream, it will attempt to calculate the codeword that
satisfies the filter's query predicate. Once the codeword has been discovered, each occurrence of this codeword
in the stream results in a notification being sent to the handler associated with the filter, along with the data
contained in the next encountered leaf node.
- Handlers offer a flexible way to handle query results. A handler could be configured to perform
a simple action (such as outputting the query results to console), or some more complex operation such as inserting
query results into a database or generating a new XML document containing the query results.
Future Work:
- investigation of validating compressed (using TREECHOP) XML files, using a compressed XML Schema
- design of a specialized xml editor. When a new document will be created using this editor, the
user interface will show the XML text as always, but internally it will maintain a "compressed" version
of the document.
- Compression of JXTA messages, using a "schema-aware" mode, where codewords were pre-calculated
for each core JXTA message type instead of having node names being transmitted adaptively in the compression stream.
- Design and provide sample implementations of applications of Treechop for SVG
- Creating animations using TREECHOP-compressed SMIL
Xml cOmpression Framework, XOF
- development of a framework, which provides compressed streams (similar to GZIPStreams), with
dynamical plugins that can be used to specify a compression algorithm (using a factory/property)
- development of an XMLFilter library, similar to XMLReader
- investigation of validating compressed (using TREECHOP) XML files, using a compressed XML Schema
- investigation of transforming compressed XML documents using compressed XSLT stylesheets
- incorporation of TREECHOP into XFILTERs, specifically building a framework, in which filters
could be specified using XPATH
- development of a specialized xml editor. When a new document will be created using this editor,
the user interface will show the XML text as always, but internally it will maintain a "compressed" version
of the document.
- fixing AXECHOP (2^k problem)
- XML-specific grammar based compression and Using PPM rather than BW
- variable size code
LINKS:
RSJ XML compression
Paper on grammar compression
Publications:
Using XML Compression for WWW Communication. T. Müldner, G. Leighton, J. Diamond
IADIS International Conference WWW/Internet 2005
TREECHOP: A Tree-based Query-able Compressor for XML
G. Leighton, T. Müldner, J. Diamond
The Ninth Canadian Workshop on Information Theory, Montreal, June 2005
AXECHOP: A Grammar-based Compressor for XML
G. Leighton, J. Diamond, T. Müldner.
Poster at DCC 2005
Links:
Seminar
Return