BICOM
BIjective COMpressor

Matt Timmermans, October 26, 2000

Contents

What is bicom?

Bicom is a data compressor in the PPM family. It is freely available and open source. It compresses well, though not as well as the best PPM-style compressors (it will get better), and slowly, though not as slowly as the best PPM compressors (and it will get faster).

It's most unique characteristic, however, is that compression with bicom is completely bijective -- any file is a possible bicom output that can be decompressed, and then recompressed back to its original form. Of course, any file is also a possible bicom input that can be compressed, and then decompressed back to its original form. At the time of its release, bicom is the only competitive compressor with this quality.

If you consider compression to be an important first step before encryption, because it minimizes redundancy in the plaintext, then you can appreciate this bijective quality, because bijectivity implies that the compression adds no known plaintext to its output.

To support encryption applications, bicom also includes a passphrase-protection option that will automatically encrypt after compressing, or decrypt before decompressing. The cipher used is the current AES proposal, Rijndael. It is applied in a unique way, however, that preserves the bijective property of the compressor, and almost never changes the size of a compressed file.

Getting bicom

The latest version of bicom is 1.01: Download

Previous Versions

1.01 is the first public release

NOTE: The bijective property implies that new versions of bicom will be incompatible with old ones, unless they contain only speed enhancements. When this is the case, it's version number will differ only in the thousandths place or later. (e.g., bicom 1.012 would be compatible with bicom 1.011 and bicom1.01, but bicom 1.02 would not).

Bicom Performance

Here are the bicom results for the calgary corpus:

File

Original Size

Compressed

Ratio

Bits/Char

BIB

111,261

26,646

76.05%

1.916

BOOK1

768,771

236,775

69.20%

2.464

BOOK2

610,856

157,979

74.14%

2.069

GEO

102,400

58,353

43.01%

4.559

NEWS

377,109

112,103

70.27%

2.378

OBJ1

21,504

10,163

52.74%

3.781

OBJ2

246,814

70,525

71.43%

2.286

PAPER1

53,161

15,698

70.47%

2.362

PAPER2

82,199

24,646

70.02%

2.399

PAPER3

46,526

15,361

66.98%

2.641

PAPER4

13,286

4,841

63.56%

2.915

PAPER5

11,954

4,453

62.75%

2.980

PAPER6

38,105

11,541

69.71%

2.423

PIC

513,216

54,214

89.44%

0.845

PROGC

39,611

11,696

70.47%

2.362

PROGL

71,646

13,984

80.48%

1.561

PROGP

49,379

9,491

80.78%

1.538

TRANS

93,695

15,372

83.59%

1.313

TOTAL

3,251,493

853,841

   

AVERAGE bits/char

per file:

2.377

overall:

2.101

Bicom algorithms and inspiration

The model

Bicom uses an unbounded context-length PPM model, implemented with a sliding window suffix tree. Of the (relatively) compact representations for unbounded PPM contexts, the suffix tree is especially useful, because it is linear in the size of the input, stores exactly one copy of each unique context, and automatically performs full update exclusion. The suffix tree was originally based on Jesper Larsson's excellent sliding-window suffix tree code. None of the original code remains, but the algorithms are still there, with the exception of Jesper's very cool hashing scheme that minimizes memory requirements -- the PPM model needed linked lists to walk anyway, so the hashes became overhead.

Bicom compresses reasonably well, because it borrows the LOE (Local order estimation) and SEE (Secondary Escape Estimation) ideas from Charles Bloom's PPMZ. Bicom doesn't currently compress as well as PPMZ, though, because I didn't implement SEE faitfully at all -- I just used the general ideas. The goal, of course, is to eventually optimize escape estimation until it is superior to PPMZ's. Charles' keeps making that target more difficult (PPMZ2 is out!), so I may restart with a faithful SEE copy when PPMZ2's source is released.

Its also possible, of course, that there are logic errors (well, misimplementations, really) that affect the compression ratio, but not the correct operation, of the bicom compressor. That would be great -- every time I've fixed one of those, the compressed size of the calgary corpus has dropped by about 10K.

Bijective Compression

Bicom's bijective property derives from the use of my bijective arithmetic encoder, the workings of which are reasonably well explained in Bijective Arithmetic Encoding with Optimal End Treatment. Bicom uses a different version of the arithmetic encoder with the maximum range expanded to 2^21, and a class structure that better supports complex models. If you want to use the arithmetic encoder in your own projects, it would probably be better to download bicom.

Bijective Encryption

The bijective arithmetic encoder produces an intermediate representation called a finitely odd stream -- it is an infinitely long sequence of bits, but with the last 1 bit at a position finitely far from the beginning, i.e., it is a finite sequence of 0s and 1s, followed by an infinite sequence of 0s. (It's called finitely odd, because the arithmetic encoder treats it as a binary fraction).

When a passphrase is specified, bicom encrypts (or decrypts) this intermediate representation using the currently proposed AES cipher, Rijndael, in CBC mode with a constant initial value and whitening of the first 16K. Because the finitely odd stream is infinitely long, and all bits are significant to the output, you can encrypt any part of it without affecting the bijective nature of the compressor. David Scott provided the idea of encrypting all but the final 1 bit in this stream, possibly using overlapping blocks at the end. The result is that all significant bits of the file are encrypted, with no known or biased plaintext.

The rijndael implementation was adapted from NIST's reference source.