The wavelet transform works by splitting a signal s0[i], denoting the 0th-level signal with i elements into a new signal s1[i/2] and a detail component, d1[i/2]. The new signal s1 is a half-resolution representation of our original, and contains the bottom half of the original's frequency domain. The detail d1 encodes the upper half of our original frequency interval. We can then repeat this process on our new s1, as many times until we are left with just one element in the signal. This is the lowest frequency, "DC" component of our original signal. The actual function that is used to decompose a signal into its half-resolution components is determined by our selection of the wavelet function.
Lifting is a scheme for generating wavelet transforms by starting with an existing transform and "lifting" it by adding additional steps to the transform. The first step begins with a "lazy" wavelet, one which simply splits the input into odd and even components, which will become our signal s and detail j components. While this step doesn't actually do much, we can add steps which will make it useful. For example, the Haar transform can be calculated by setting d[j] = d[j] - s[j], then s[j] = s[j] + d[j] / 2. This effectively predicts that the odd values will be equal to the even values. That is, when the signal is constant, the wavelet detail coefficients will be equal to zero. We can choose other procedures for linear prediction, cubic polynomials, b-splines, etc. With the lifting scheme, all computation can be done in-place, reducing memory requirements. We can also compute many of the transforms as integer to integer transforms, allowing for perfect reconstruction.
Compression is acheived because, if the signal is correlated in a way that matches our filter, most of the detail coefficients will be equal to zero, and the rest will be clustered around zero, in a semi-gaussian distribution. We may then take advantage of this distribution with statistical coding, and we may quantize the coefficients to get even better compression. Additionaly, perceptually (in images and sometimes in audio) higher frequencies are less important than lower frequencies. We can use this knowledge to apply extra quantization to the higher frequency coefficients.
I implemented an integer-to-integer wavelet transform based on the lifting scheme. This was actually fairly easy after I really understood it. I designed an interface called Wavelet, then implemented it in HaarWavelet, FourFourWavelet, and FourTwoWavelet. The latter two are derived from 8. I decided that dynamic huffman coding would be a good way to compress the data (9). I wrote BitOutputStream and BitInputStream to read and write a bitstream. After verifying that these worked, I built HuffmanOutputStream and HuffmanInputStream on top of them. I put the quantizer and the interface between the wavelet representation and the Huffman coding in the class WaveletCompression.
To actually encode a file, a WAVInputStream is created, and read into a buffer. The wavelet transform uses 32-bit signed integers internally to avoid overflow, so standard 16-bit audio has to be converted. Then, the FWT method of a Wavelet class is called, specifying the data and the depth of the transform (number of times the wavelet transform is applied to the signal). Finally, the compress method of WaveletCompression is called, passing the data, an array of bit lengths, the depth of the wavelet transform, and an output stream. This quantizes the data in each level to the bit depth specified and encodes with the HuffmanOutputStream. Decoding is a simple matter of calling the decompress method of WaveletCompression, the IFWT method of a Wavelet class, and writing the file with WAVOutputStream. All this is implemented in my simple test program Encode.
The most difficult part about this project was understanding exactly how the wavelet transform worked, and how a bug or a change in the code would affect my output. The interactions between the wavelet transform, the quantizer, and the encoding are complex, so it was often difficult to decide what to examine when a file got too large or didn't sound right. One surprise that I ran into is the fact that the left and right channels in a stereo signal are somewhat less correlated than I surmised. I originally planned to use a Haar wavelet to analyze the stereo signal, but I found that the output was very poor. I analyzed stereo differences (L-R) and found much larger differences than I expected, so I ended up working with mono files.
I implemented my code so that it should be fairly extensible. One aspect of this is in the quantizer - rather than actually dividing the value to be coded by a quantization level, I simply masked out the low bits, and stored it a a 16-bit integer. The huffman coding minimizes impact of this technique on file size (typically less than a 1kB difference), and it allows for non-uniform quantization, or for quantization that varies over time, as different frequencies become more or less important. This will ease implementation of a perceptual model for the codec.