We will
survey several key papers on Discrete Source Coding, for Data Compression.
Shannon’s
paper that established the field: Shannon1948 and other basic definitions and properties,
including a very brief refresher to Probability Theory. Lectures
1&2 (24.2, 3.3)
Popular
introductions to many kinds of data compression algorithms http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html and http://datacompression.info/ and more.
Subjects for
Students Lectures (10.3 Huffman Omer), (17.3 LZ77 Natan), (24.3 Integer Ori), (31.3
BW Ilan), (7.4 Web, Yoni+Alex), (14.4 VF, Shahar Miki), [Pessach], (5.5
delta Eran ), [Yom HaAtsmaut], (19.5
Grammar Matan), (26.5 arithmetic Alex), (2.6
LZ78 Uri), (9.6 PPM Tsahi)
Papers in the .Djv, or .DjVu format are read with a
plug-in for internet browsers, free from: http://www.lizardtech.com/download/
(1) Huffman
Coding: Algorithm, proof of optimality (from
here, for example), canonical Huffman and what it is good for (from here, or
here for example)
-- you can find by yourself anywhere you
like in the literature. Much appears on the internet. Expand a little on one
subject from the following, or another that you find.
(a) Parallel
Huffman Decoding (Klein and Wiseman)
(b) Self
synchronization property. (e.g. above Klein and Wiseman, or reference to
results of Furgeson or
Zeger )
(c) Some
interesting properties of Huffman Codes (Gallager)
(d) How is
Huffman Coding doing after many years. (Klein) Omer (10.3). His
presentation.
(2) Integer Coding: infinite number of codewords, one
codeword for each integer. Here is a nice introduction
to them all, with thanks to UCF. Formally: Elias
or Rodeh
(present the method, no need to get into proofs) and Golomb (method and when optimal). Then
describe the Move
To Front Algorithm that uses Elias. Here is a harder version to read of
Move To Front, by Elias.. Ori
(24.3)
(3) Variable to Fixed codes for memoryless
sources: Tunstall Code for a known
source (p. 2 of this paper) and the generalization to one tree that fits all sources . The
latter is based on Lawrence and Schalkwijk.
The proofs in these are not very easy to follow. The algorithms presented are simple. No need
to go deep into proofs, just mention main ideas. Shahar+Miki (14.4) Their
presentation.
(4) Ziv Lempel 1977. The award.
Possible extension: the DEFLATE standard (explained nicely here: http://www.gzip.org/zlib/feldspar.html,
and appears formally in many places, e.g.:
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1951.html).
The standard is based on LZSS paper by Storer and Szymanski, summarized nicely
by Baden. Here is a
nice, cult one implementation thereof by Ross
Williams and here
is his funny report of his presentation of this work.. Natan, (17.3)
His presentation.
(5) Ziv Lempel 1978.
Possible extension: the Welch Version
(called LZW, and discussed much in the internet, e.g. http://www.cs.cf.ac.uk/Dave/Multimedia/node214.html
or http://dogma.net/markn/articles/lzw/lzw.htm).
Another interesting extension of the Ziv-Lempel 1978 compression method: Variations on a Theme by Ziv and
Lempel. The idea goes into
compression of Graphs and Trees. Uri (2.6)
His presentation.
(6) Arithmetic coding, theory and practice: CACM paper, which was explained by Vitter and then revisited .
Explained in popular ways in many sites, including the two mentioned at the top
of this page. (26.5) Alex His
presentation
(7) Burrows and
Wheeler Block Sorting Algorithm, the original research report and analyses and
implementation The latest word in
data compression: sort and blend and mingle the sequence, and then it
compresses better.. Here is another explanation and
example of implementation, and here is another one. Ilan (31.3)
His presentation.
(8) Prediction by Partial Matching (PPM): at each point, look at the recent past, and
find match with previous cases, and draw conclusions from these previous cases.
The basics, Unbounded
PPM , PPM for English, PPM for software, practical PPM and a newer one
(from DCC, Data Compression Conference, 2002), and a very recent news, DCC 2003: cleaning the
model, which includes an explanation of the whole method. The probability of the yet unseen is an
issue here. (No need to do all. Best is to start with the basics” and/or
read the practical PPM, or the 2003 paper which is also an overview, and then
extend with Unbounded PPM, or with any of the others). Tsahi
(9.6) His presentation
(9) Grammar Based methods: Cameron
(1988) with applications to Pascal; Lehman
(2002) Finding THE grammar that generates the given string– intractable,
and hence approximations. Nevill-Manning (1994) looking for formal structure in compressed
text; Context Free Languages -- the language
understood by us.. rather difficult to digest all, present main ideas and the
way entropy is computed based on derivation tree. Matan (19.5)
His presentation.
(10) Chaitin – formal proof for complexity of
sequences, also called algorithmic complexity, or Kolmogorov complexity. Most sequences are random, but we can not
formally prove for any given specific sequence that it is random. http://www.umcs.maine.edu/~chaitin/ieee74.html
with (exciting for philosophy lovers) background from http://www.umcs.maine.edu/~chaitin/ A survey by Vitanyi and
Grunwald, also comparing with Shannon.
(11) Compression of the Web
graph, and several other tricks for that mission. An experiment
with page
re-ordering for better compression of Search Engine index. (7.4) Alex and Yoni. Their presentation.
(12) Delta compression: compression of one file by
references to similar files known to both sender and receiver. xdelta
and journal paper about it, zdelta and applications
to file systems. Tridgell describes
his rsync very nicely, he
employs delta ideas, in particular chapters 3 and 4. Eran (5.5)
His presentation.