LZ77 compression algorithm

Compression algorithms are designed to reduce the storage requirements of a given block of data, which they achieve by identifying repeating sequences of bytes and re-encoding them with less redundancy. The compression function should work in a way that is reversible by a decompression algorithm.

The LZ77 compression algorithm reduces the redundancy of its input by identifying repeated prefix strings and replacing them with pointers to blocks within a byte buffer, usually known as a sliding window. The prefix pointers are encoded in a way that’s context dependant, so for each compression packet (describing a token and a prefix pointer), the state of the window during compression and decompression must be the same. The Sliding Window ensures its integrity by observing the following rules:

Only the token value gets appended to the sliding window.
Every generated packet has a token.
Tokens have a size of 1 byte.
Prefixes define offsets relative to the Sliding Window’s history buffer before (simplification) appending the current token.

I’ve created a diagram to show how LZ77 would go about compressing the string AABAAABAB. In order (packet 1, 2, 3 and 4), the sliding window contents would be [], [A], [A, B] and [A, B, A]. It would end with the contents [A, B, A, B]. The dictionary is equal to the last n tokens (sans-prefix), where n is the sliding window size in bytes.

A diagram showing that the string `AABAAABAB` compresses to `ABAA`

Cached dictionary

Example code

I’ve created an example implementation of LZ77 that can generate the compression packets for a given input.