Matt Timmermans, October 26, 2000
Bicom is a data compressor in the PPM family. It is freely available and open source. It compresses well, though not as well as the best PPM-style compressors (it will get better), and slowly, though not as slowly as the best PPM compressors (and it will get faster).
It's most unique characteristic, however, is that compression with bicom is completely bijective -- any file is a possible bicom output that can be decompressed, and then recompressed back to its original form. Of course, any file is also a possible bicom input that can be compressed, and then decompressed back to its original form. At the time of its release, bicom is the only competitive compressor with this quality.
If you consider compression to be an important first step before encryption, because it minimizes redundancy in the plaintext, then you can appreciate this bijective quality, because bijectivity implies that the compression adds no known plaintext to its output.
To support encryption applications, bicom also includes a passphrase-protection option that will automatically encrypt after compressing, or decrypt before decompressing. The cipher used is the current AES proposal, Rijndael. It is applied in a unique way, however, that preserves the bijective property of the compressor, and almost never changes the size of a compressed file.
The latest version of bicom is 1.01: Download
Previous Versions
1.01 is the first public release
NOTE: The bijective property implies that new versions of bicom will be incompatible with old ones, unless they contain only speed enhancements. When this is the case, it's version number will differ only in the thousandths place or later. (e.g., bicom 1.012 would be compatible with bicom 1.011 and bicom1.01, but bicom 1.02 would not).
Here are the bicom results for the calgary corpus:
File |
Original Size |
Compressed |
Ratio |
Bits/Char |
BIB |
111,261 |
26,646 |
76.05% |
1.916 |
BOOK1 |
768,771 |
236,775 |
69.20% |
2.464 |
BOOK2 |
610,856 |
157,979 |
74.14% |
2.069 |
GEO |
102,400 |
58,353 |
43.01% |
4.559 |
NEWS |
377,109 |
112,103 |
70.27% |
2.378 |
OBJ1 |
21,504 |
10,163 |
52.74% |
3.781 |
OBJ2 |
246,814 |
70,525 |
71.43% |
2.286 |
PAPER1 |
53,161 |
15,698 |
70.47% |
2.362 |
PAPER2 |
82,199 |
24,646 |
70.02% |
2.399 |
PAPER3 |
46,526 |
15,361 |
66.98% |
2.641 |
PAPER4 |
13,286 |
4,841 |
63.56% |
2.915 |
PAPER5 |
11,954 |
4,453 |
62.75% |
2.980 |
PAPER6 |
38,105 |
11,541 |
69.71% |
2.423 |
PIC |
513,216 |
54,214 |
89.44% |
0.845 |
PROGC |
39,611 |
11,696 |
70.47% |
2.362 |
PROGL |
71,646 |
13,984 |
80.48% |
1.561 |
PROGP |
49,379 |
9,491 |
80.78% |
1.538 |
TRANS |
93,695 |
15,372 |
83.59% |
1.313 |
TOTAL |
3,251,493 |
853,841 |
||
AVERAGE bits/char |
per file: |
2.377 |
||
overall: |
2.101 |
|||
Bicom uses an unbounded context-length PPM model, implemented with a sliding window suffix tree. Of the (relatively) compact representations for unbounded PPM contexts, the suffix tree is especially useful, because it is linear in the size of the input, stores exactly one copy of each unique context, and automatically performs full update exclusion. The suffix tree was originally based on Jesper Larsson's excellent sliding-window suffix tree code. None of the original code remains, but the algorithms are still there, with the exception of Jesper's very cool hashing scheme that minimizes memory requirements -- the PPM model needed linked lists to walk anyway, so the hashes became overhead.
Bicom compresses reasonably well, because it borrows the LOE (Local order estimation) and SEE (Secondary Escape Estimation) ideas from Charles Bloom's PPMZ. Bicom doesn't currently compress as well as PPMZ, though, because I didn't implement SEE faitfully at all -- I just used the general ideas. The goal, of course, is to eventually optimize escape estimation until it is superior to PPMZ's. Charles' keeps making that target more difficult (PPMZ2 is out!), so I may restart with a faithful SEE copy when PPMZ2's source is released.
Its also possible, of course, that there are logic errors (well, misimplementations, really) that affect the compression ratio, but not the correct operation, of the bicom compressor. That would be great -- every time I've fixed one of those, the compressed size of the calgary corpus has dropped by about 10K.
Bicom's bijective property derives from the use of my bijective arithmetic encoder, the workings of which are reasonably well explained in Bijective Arithmetic Encoding with Optimal End Treatment. Bicom uses a different version of the arithmetic encoder with the maximum range expanded to 2^21, and a class structure that better supports complex models. If you want to use the arithmetic encoder in your own projects, it would probably be better to download bicom.
The bijective arithmetic encoder produces an intermediate representation called a finitely odd stream -- it is an infinitely long sequence of bits, but with the last 1 bit at a position finitely far from the beginning, i.e., it is a finite sequence of 0s and 1s, followed by an infinite sequence of 0s. (It's called finitely odd, because the arithmetic encoder treats it as a binary fraction).
When a passphrase is specified, bicom encrypts (or decrypts) this intermediate representation using the currently proposed AES cipher, Rijndael, in CBC mode with a constant initial value and whitening of the first 16K. Because the finitely odd stream is infinitely long, and all bits are significant to the output, you can encrypt any part of it without affecting the bijective nature of the compressor. David Scott provided the idea of encrypting all but the final 1 bit in this stream, possibly using overlapping blocks at the end. The result is that all significant bits of the file are encrypted, with no known or biased plaintext.
The rijndael implementation was adapted from NIST's reference source.