CSC 499 S09: Readings
From CSWiki
A place to put all the useful and fun things you've read.
Our Bibtex file
Contains references to texts we used doing the course of the semester. http://www.cs.grinnell.edu/~leguiato/MAP499/biblio.bib
PAQ8
PAQ8 variants are every winner of the Hutter Prize. http://www.cs.fit.edu/~mmahoney/compression/#paq
PAQ8l is Matt Mahoney's latest version: http://www.cs.grinnell.edu/~leguiato/MAP499/paq8l.cpp
XML Compression Readings
Type-Based Compression of XML Data - Christopher League Long Island University Kenjone Eng Long Island University DCC archive Proceedings of the 2007 Data Compression Conference table of contents Pages 273-282 Year of Publication: 2007 ISBN ~ ISSN:1068-0314 , 0-7695-2791-4 Comparative Analysis of XML Compression Technologies - Wilfred Ng Department of Computer Science, The Hong Kong University of Science and Technology Wai-Yeung Lam Department of Computer Science, The Hong Kong University of Science and Technology James Cheng Department of Computer Science, The Hong Kong University of Science and Technology World Wide Web archive Volume 9 , Issue 1 (March 2006) table of contents Pages: 5 - 33 Year of Publication: 2006 ISSN:1386-145X
DMC Model: Original Source
http://plg1.cs.uwaterloo.ca/~ftp/dmc/dmc.c
PAQ1 Source Code
Theoretically, the first one should be simplest. http://www.cs.grinnell.edu/~leguiato/MAP499/paq1.cpp (Originally from http://www.cs.fit.edu/~mmahoney/compression/paq1.cpp)
Ian's thoughts
From the paper: Could we further optimize the probabilities by doing a "cache effect" check as Mahoney does on p.2? (Maybe later winners did this)
Lossless Compression Handbook
Khalid Sayood, ed. ISBN 0-12-629861 On ILL until 17 February 2009 (with possibility of renewal)
Approximately the first half of the book seems to focus on background (e.g. information theory) and general approaches. The latter portion of the book has chapters dedicated to specific applications of compression.
Chapter 1: Information Theory Behind Source Coding
About what you'd expect. Some equations may want copying, but this chapter covers area that is well represented in other texts.
Chapter 4: Huffman Coding
Covers basic Huffman Codes, plenty of math, and many variations/extensions. Some of these variations look promising. Provides bounds on information content, average length of codes, maximum length of codes and more.
Huffman Prefixed Codes
Applies to very large alphabets, divides symbols into equivalence classes with roughly the same probability, provides prefix codes for each class and indices to symbols within classes. Would it be possible to implement a code where every "word" (loosely speaking) was a symbol in the alphabet and we used a Huffman-for-huge-alphabets strategy to encode it? It seems like this starts to blur the line with dictionary-based encoding.
Extended Huffman Codes
Notes that Huffman codes get impractical on very large alphabets, shows why mathematically, and offers a pointer to Ziv-Lempel dictionary-based encoding (LZW). LZW is what runs gzip (not in the book, but useful).
Adaptive Huffman Coding
Rather than being determined in a pre-processing step and then remaining static, the code tree is built as the data is encoded and dynamically altered to reflect the input as it is read. Seems to happen in a single pass over the data. By having the decoder follow the same rules as the encoder, the decoder is able to implicitly reconstruct the tree from the encoded data so explicitly storing a full representation of the tree is not necessary. (Note: much of that summary is not directly from this source.) The book notes that AHC is commonly used in telecommunication where scanning a corpus before encoding is impossible since the corpus is being generated and must be transmitted in real time.
Brute Force Adaptive Huffman
Options for brute-force AHC (my acronym): recompute tree after each symbol is read (bad), recompute tree after n symbols are read (better), maintain a list of symbols indexed by frequency that is effectively bubble-sorted in realtime, recomputing tree when a symbol's ranking changes (better still, but not as efficient as subsequent algorithms).
The Faller, Gallager, and Knuth (FGK) Algorithm
It has Knuth it its title, so it must be good. Method of updating the tree adaptively. Chief disadvantage is that it changes only in response to what it reads, cannot capture trends in the data and converges slowly on specific characteristics of the data rather than being able to know them from the start. Knuth's variant allows for new characters to be added in the stream through the use of an escape character and zero-frequency placeholder node in the evolving tree.
Vitter's Algorithm: Algorithm Λ
Variant of FGK that encodes a sequence of length s in no more than s bits more than a static Huffman, and will have the tree built in to that transmission length. (So, Λ would be provably more efficient than static Huffman coding in any case where the number of bits required to store the code tree is greater than the length of the sequence (measured in symbols, I think).)
Other Adaptive Huffman Coding Algorithms
I was interested by mention of the Pigeon & Bengio's algorithm M which is described as showing promise in encoding Unicode files more efficiently than Λ. Like #Huffman Prefixed Codes, M provides codes for an equivalence class of symbols which are then accessed by an index.
An Observation on Adaptive Algorithims
To produce an efficient code, an adaptive algorithm should observe a sequence many times longer than the size of the alphabet. On sequences that are short relative to the alphabet length, a static code should be better. There is also a suggestion of prefix code where the codes for each equivalence class are updated adaptively.
Efficient Implementations
Basically, memory-efficiency and time-efficiency are mutually exclusive. We're bounded on both for the contest. It also seems to be that extreme time efficiency could translate into such bad memory efficiency that we get in to locality issues, paging etc. thus ruining our time efficiency.
Memory-Efficient Algorithms
Trees are sparse, therefore they waste a lot of memory. There are tricks to replace the tree with tables, but learning their details would require following the citations.
Speed-Efficient Algorithms
One can use an FSA to work on more than 1 bit at a time, states represent all possible interpretations of a bit of given length, and current state and transitions parse out which are possible interpretations given the context. It is also suggested that the binary tree may be stored as a quaternary tree, thereby reducing its depth as well as the number of (all empty) internal nodes. But, these ideas only work on static codes, no ideas are given for fast decoding of AHCs.

