Programming Assignment 5: Burrows–Wheeler Data Compression

Note that, as of November 2017, the names of the methods in BurrowsWheeler changed from encode() and decode() to transform() and inverseTransform(), respectively.

Implement the Burrows–Wheeler data compression algorithm. This revolutionary algorithm outcompresses gzip and PKZIP, is relatively easy to implement, and is not protected by any patents. It forms the basis of the Unix compression utility bzip2.

The Burrows–Wheeler compression algorithm consists of three algorithmic components, which are applied in succession:

  1. Burrows–Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times.

  2. Move-to-front encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear more frequently than others.

  3. Huffman compression. Given a text file in which certain characters appear more frequently than others, compress it by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.
Step 3 is the one that compresses the message: it is particularly effective because Steps 1 and 2 produce a text file in which certain characters appear much more frequently than others. To expand a message, apply the inverse operations in reverse order: first apply the Huffman expansion, then the move-to-front decoding, and finally the inverse Burrows–Wheeler transform. Your task is to implement the Burrows–Wheeler and move-to-front components.

Binary input and binary output. To enable your programs to work with binary data, you will use BinaryStdIn and BinaryStdOut, which are described in Algorithms, 4th edition. To display the binary output when debugging, you can use HexDump, which takes a command-line argument n, reads bytes from standard input and writes them to standard output in hexadecimal, n per line.

% more abra.txt
ABRACADABRA!

% java-algs4 edu.princeton.cs.algs4.HexDump 16 < abra.txt
41 42 52 41 43 41 44 41 42 52 41 21
96 bits
Note that in ASCII, 'A' is 41 (hex) and '!' is 21 (hex).

Huffman compression and expansion. Huffman (Program 5.10 in Algorithms, 4th edition) implements the classic Huffman compression and expansion algorithms.

% java-algs4 edu.princeton.cs.algs4.Huffman - < abra.txt | java-algs4 edu.princeton.cs.algs4.HexDump 16
50 4a 22 43 43 54 a8 40 00 00 01 8f 96 8f 94
120 bits
% java-algs4 edu.princeton.cs.algs4.Huffman - < abra.txt | java-algs4 edu.princeton.cs.algs4.Huffman +
ABRACADABRA!
You will not write any code for this step.

Move-to-front encoding and decoding. The main idea of move-to-front encoding is to maintain an ordered sequence of the characters in the alphabet, and repeatedly read a character from the input message, print out the position in which that character appears, and move that character to the front of the sequence. As a simple example, if the initial ordering over a 6-character alphabet is A B C D E F, and we want to encode the input CAAABCCCACCF, then we would update the move-to-front sequence as follows:

move-to-front    in   out
-------------    ---  ---
 A B C D E F      C    2 
 C A B D E F      A    1
 A C B D E F      A    0
 A C B D E F      A    0
 A C B D E F      B    2
 B A C D E F      C    2
 C B A D E F      C    0
 C B A D E F      C    0
 C B A D E F      A    2
 A C B D E F      C    1
 C A B D E F      C    0
 C A B D E F      F    5
 F C A B D E  
If the same character occurs next to each other many times in the input, then many of the output values will be small integers, such as 0, 1, and 2. This extremely high frequency of certain characters makes an ideal scenario for Huffman coding. Name your program MoveToFront.java and organize it using the following API:
public class MoveToFront {
    // apply move-to-front encoding, reading from standard input and writing to standard output
    public static void encode()

    // apply move-to-front decoding, reading from standard input and writing to standard output
    public static void decode()

    // if args[0] is '-', apply move-to-front encoding
    // if args[0] is '+', apply move-to-front decoding
    public static void main(String[] args)
}

Performance requirements.   The running time of both move-to-front encoding and decoding must be proportional to n R (or better) in the worst case and proportional to n + R (or better) in practice on inputs that arise when compressing typical English text, where n is the number of characters in the input and R is the alphabet size. The amount of memory used by both move-to-front encoding and decoding must be proportional to n + R (or better) in the worst case.

Circular suffix array. To efficiently implement the key component in the Burrows–Wheeler transform, you will use a fundamental data structure known as the circular suffix array, which describes the abstraction of a sorted array of the n circular suffixes of a string of length n. As an example, consider the string "ABRACADABRA!" of length 12. The table below shows its 12 circular suffixes and the result of sorting them.

 i       Original Suffixes           Sorted Suffixes         index[i]
--    -----------------------     -----------------------    --------
 0    A B R A C A D A B R A !     ! A B R A C A D A B R A    11
 1    B R A C A D A B R A ! A     A ! A B R A C A D A B R    10
 2    R A C A D A B R A ! A B     A B R A ! A B R A C A D    7
 3    A C A D A B R A ! A B R     A B R A C A D A B R A !    0
 4    C A D A B R A ! A B R A     A C A D A B R A ! A B R    3
 5    A D A B R A ! A B R A C     A D A B R A ! A B R A C    5
 6    D A B R A ! A B R A C A     B R A ! A B R A C A D A    8
 7    A B R A ! A B R A C A D     B R A C A D A B R A ! A    1
 8    B R A ! A B R A C A D A     C A D A B R A ! A B R A    4
 9    R A ! A B R A C A D A B     D A B R A ! A B R A C A    6
10    A ! A B R A C A D A B R     R A ! A B R A C A D A B    9
11    ! A B R A C A D A B R A     R A C A D A B R A ! A B    2
We define index[i] to be the index of the original suffix that appears ith in the sorted array. For example, index[11] = 2 means that the 2nd original suffix appears 11th in the sorted order (i.e., last alphabetically).

Your job is to implement the following circular suffix array API, which provides the client access to the index[] values:

public class CircularSuffixArray {
   public CircularSuffixArray(String s)    // circular suffix array of s
   public int length()                     // length of s
   public int index(int i)                 // returns index of ith sorted suffix
   public static void main(String[] args)  // unit testing (required)
}
Corner cases.   Throw a java.lang.IllegalArgumentException in the constructor if the argument is null. Throw a java.lang.IllegalArgumentException in the method index() if i is outside ts prescribed range (between 0 and n − 1).

Unit testing.  Your main() method must call each public method directly and help verify that they work as prescribed (e.g., by printing results to standard output).

Performance requirements.   On typical English text, your data type must use space proportional to n + R (or better); the constructor must take time proportional to n log n (or better) on typical English text; and the methods length() and index() must take constant time. Warning: beginning with Java 7, Update 6, the substring() method takes time and space proportional to the length of the substring—in other words, you cannot form the n circular suffixes explicitly because that would take both quadratic time and space.

Burrows–Wheeler transform. The goal of the Burrows–Wheeler transform is not to compress a message, but rather to transform it into a form that is more amenable to compression. The transform rearranges the characters in the input so that there are lots of clusters with repeated characters, but in such a way that it is still possible to recover the original input. It relies on the following intuition: if you see the letters hen in English text, then most of the time the letter preceding it is t or w. If you could somehow group all such preceding letters together (mostly t’s and some w’s), then you would have an easy opportunity for data compression.

Name your program BurrowsWheeler.java and organize it using the following API:
public class BurrowsWheeler {
    // apply Burrows-Wheeler transform, reading from standard input and writing to standard output
    public static void transform()

    // apply Burrows-Wheeler inverse transform, reading from standard input and writing to standard output
    public static void inverseTransform()

    // if args[0] is '-', apply Burrows-Wheeler transform
    // if args[0] is '+', apply Burrows-Wheeler inverse transform
    public static void main(String[] args)
}
Performance requirements.   The running time of your Burrows–Wheeler transform must be proportional to n + R (or better) in the worst case, excluding the time to construct the circular suffix array. The running time of your Burrows–Wheeler inverse transform must be proportional to n + R (or better) in the worst case. The amount of memory used by both the Burrows–Wheeler transform and inverse transform must be proportional to n + R (or better) in the worst case.

Analysis (optional). Once you have MoveToFront.java and BurrowsWheeler.java working, compress some text files. Then, test it on some binary files. Calculate the compression ratio achieved for each file and report the time to compress and expand each file. (Here, compression and expansion consists of applying BurrowsWheeler, MoveToFront, and Huffman in succession.) Finally, determine the order of growth of the running time of each of your methods, both in the worst case and on typical English text inputs.

Deliverables. Submit MoveToFront.java, BurrowsWheeler.java, CircularSuffixArray.java, any other supporting files (excluding algs4.jar). You may not call any library functions other those in java.lang, java.util, and algs4.jar.

This assignment was developed by Kevin Wayne.
Copyright © 2004.