Subjects in Discrete Source Coding

Seminar, Spring 2005

 

Dafna Sheinwald,  dafna@il.ibm.com,  Tel. 8296477

We will survey several key papers on Discrete Source Coding, for Data Compression.

 

Shannon’s paper that established the field: Shannon1948  and other basic definitions and properties, including a very brief refresher to Probability Theory.  Lectures 1&2  (24.2,  3.3)

 

Popular introductions to many kinds of data compression algorithms  http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html  and  http://datacompression.info/  and more.

 

Subjects for Students Lectures (10.3 Huffman Omer), (17.3 LZ77 Natan), (24.3 Integer Ori), (31.3 BW Ilan), (7.4 Web, Yoni+Alex), (14.4 VF, Shahar Miki), [Pessach], (5.5 delta  Eran ), [Yom HaAtsmaut], (19.5 Grammar Matan), (26.5 arithmetic Alex), (2.6  LZ78 Uri), (9.6 PPM Tsahi)

 

Papers in the .Djv, or .DjVu format are read with a plug-in for internet browsers, free from: http://www.lizardtech.com/download/

 

(1) Huffman Coding: Algorithm, proof of optimality (from here, for example), canonical Huffman and what it is good for (from here, or here for example) --  you can find by yourself anywhere you like in the literature. Much appears on the internet. Expand a little on one subject from the following, or another that you find.

      (a) Parallel Huffman Decoding (Klein and Wiseman)

      (b) Self synchronization property. (e.g.  above Klein and Wiseman, or reference to results of Furgeson or Zeger )

      (c) Some interesting properties of Huffman Codes (Gallager)

      (d) How is Huffman Coding doing after many years. (Klein)    Omer (10.3).  His presentation.

 

(2) Integer Coding: infinite number of codewords, one codeword for each integer. Here is a nice introduction to them all, with thanks to UCF. Formally: Elias or Rodeh  (present the method, no need to get into proofs) and Golomb (method and when optimal). Then describe the Move To Front Algorithm that uses Elias. Here is a harder version to read of Move To Front, by Elias..   Ori  (24.3)

     

(3) Variable to Fixed codes for memoryless sources:  Tunstall Code for a known source (p. 2 of this paper)  and the generalization to one tree that fits all sources . The latter is based on  Lawrence  and  Schalkwijk.  The proofs in these are not very easy to follow.  The algorithms presented are simple. No need to go deep into proofs, just mention main ideas.  Shahar+Miki (14.4)  Their presentation.

 

(4) Ziv Lempel 1977. The award.  Possible extension: the DEFLATE standard (explained nicely here:  http://www.gzip.org/zlib/feldspar.html, and appears formally in many places, e.g.:  http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1951.html). The standard is based on LZSS paper by Storer and Szymanski, summarized nicely by Baden. Here is a nice, cult one implementation thereof by Ross Williams and here is his funny report of his presentation of this work..  Natan, (17.3)  His presentation.

 

(5) Ziv Lempel 1978. Possible extension: the Welch Version (called LZW, and discussed much in the internet, e.g. http://www.cs.cf.ac.uk/Dave/Multimedia/node214.html or http://dogma.net/markn/articles/lzw/lzw.htm). Another interesting extension of the Ziv-Lempel 1978 compression method: Variations on a Theme by Ziv and Lempel.  The idea goes into compression of  Graphs and Trees.  Uri (2.6)   His presentation.

 

(6) Arithmetic coding, theory and practice: CACM paper, which was explained by Vitter and then revisited . Explained in popular ways in many sites, including the two mentioned at the top of this page.       (26.5) Alex   His presentation

 

(7) Burrows and Wheeler Block Sorting Algorithm, the original research report and analyses and implementation   The latest word in data compression: sort and blend and mingle the sequence, and then it compresses better.. Here is another explanation and example of implementation, and here is another one.   Ilan (31.3)  His presentation.

 

(8) Prediction by Partial Matching (PPM):  at each point, look at the recent past, and find match with previous cases, and draw conclusions from these previous cases. The basics,   Unbounded PPM , PPM for English, PPM for software, practical PPM and a newer one (from DCC, Data Compression Conference, 2002), and a very recent news, DCC 2003: cleaning the model, which includes an explanation of the whole method. The probability of the yet unseen is an issue here. (No need to do all. Best is to start with the basics” and/or read the practical PPM, or the 2003 paper which is also an overview, and then extend with Unbounded PPM, or with any of the others).    Tsahi  (9.6)  His presentation

 

(9) Grammar Based methods:  Cameron (1988) with applications to Pascal; Lehman (2002) Finding THE grammar that generates the given string– intractable, and hence approximations. Nevill-Manning (1994) looking for formal structure in compressed text;    Context Free Languages -- the language understood by us.. rather difficult to digest all, present main ideas and the way entropy is computed based on derivation tree.  Matan (19.5)  His presentation.

 

(10) Chaitin – formal proof for complexity of sequences, also called algorithmic complexity, or Kolmogorov complexity.  Most sequences are random, but we can not formally prove for any given specific sequence that it is random.  http://www.umcs.maine.edu/~chaitin/ieee74.html with (exciting for philosophy lovers) background  from  http://www.umcs.maine.edu/~chaitin/    A survey by Vitanyi and Grunwald, also comparing with Shannon.   

 

(11) Compression of the Web graph, and several other tricks for that mission. An experiment with page re-ordering for better compression of Search Engine index.  (7.4) Alex and Yoni. Their presentation.

 

(12) Delta compression: compression of one file by references to similar files known to both sender and receiver.  xdelta and journal paper about it, zdelta and applications to file systems.   Tridgell describes his rsync very nicely, he employs delta ideas, in particular chapters 3 and 4.  Eran (5.5)  His presentation.