1 | 1. Compression algorithm (deflate)
|
---|
2 |
|
---|
3 | The deflation algorithm used by gzip (also zip and zlib) is a variation of
|
---|
4 | LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
|
---|
5 | the input data. The second occurrence of a string is replaced by a
|
---|
6 | pointer to the previous string, in the form of a pair (distance,
|
---|
7 | length). Distances are limited to 32K bytes, and lengths are limited
|
---|
8 | to 258 bytes. When a string does not occur anywhere in the previous
|
---|
9 | 32K bytes, it is emitted as a sequence of literal bytes. (In this
|
---|
10 | description, `string' must be taken as an arbitrary sequence of bytes,
|
---|
11 | and is not restricted to printable characters.)
|
---|
12 |
|
---|
13 | Literals or match lengths are compressed with one Huffman tree, and
|
---|
14 | match distances are compressed with another tree. The trees are stored
|
---|
15 | in a compact form at the start of each block. The blocks can have any
|
---|
16 | size (except that the compressed data for one block must fit in
|
---|
17 | available memory). A block is terminated when deflate() determines that
|
---|
18 | it would be useful to start another block with fresh trees. (This is
|
---|
19 | somewhat similar to the behavior of LZW-based _compress_.)
|
---|
20 |
|
---|
21 | Duplicated strings are found using a hash table. All input strings of
|
---|
22 | length 3 are inserted in the hash table. A hash index is computed for
|
---|
|
---|