with Optimal End Treatment

*Matt Timmermans, September 1999*

This page describes the operation of the bijective arithmetic
compressor. This compressor has the nifty property that *any
file at all* can be *decompressed* and then compressed
back to its original form, in addition to the normal property
that any file can be *compressed* and then decompressed
back to its original form.

One of the techniques used to accomplish this is a better way of indicating the end of the compressed stream.

My simple bijective arithmetic compressor, with commented source, is here: biacode.zip

David A. Scott's bijective Huffman compressor (the first bijective compressor known) can be found on David's Compression Page

The rest of this page describes the operation of the bijective arithmetic compressor in my own strange way -- it starts from the beginning and requires no prior knowledge of compression techniques, but it goes way too fast:

- Entropy Coding
- Arithmetic Coding
- End Treatment
- End Treatment for Whole-Bit Encodings
- Bijective Arithmetic Compression

We begin with a source of symbols, drawn from a known alphabet, and we are given the task of writing down the symbols as the source produces them.

We are, in reality, a computer program, and so
"writing" consists, for us, of producing a sequence of
bits. We are also a *compression* program, and so our user
expects us to write these symbols using as few bits as possible.

Let us say that our alphabet consists of the 8 symbols ABCDEFGH. Each time we get a symbol from the source, then, there are 8 possibilities.

Now, a bit, like the ones we must write, is a unit of
information. A bit can be either 0 or 1 -- there are two
possibilities. Writing a bit, therefore, conveys exactly enough
information to *distinguish* between two possibilities.
This is not enough information to uniquely identify every symbol
in our alphabet, and so it is clear that we will (at least
sometimes) be required to write more than one bit for each symbol
of input.

We are left considerable leeway in how to do this, however.

Given an arbitrary symbol *x*, for example, we may
adopt the following strategy:

- If
*x*= A or*x =*B, write 0. then:- If
*x=*A write 0 - If
*x=*B write 1

- If
- If
*x =*C or D or E or F or G or H, write 1, then:- If
*x*= C or D, write 0, then:- If
*x*= C, write 0 - If
*x*= D, write 1

- If
- If
*x =*E or F or G or H write 1, then:- If
*x =*E or F, write 0, then:- If
*x =*E, write 0 - If
*x =*F, write 1

- If
- If
*x =*G or H, write 1, then:- If
*x*= G, write 0 - If
*x*= H, write 1

- If

- If

- If

As you can see, since a bit provides exactly enough
information to distinguish between two possibilities, we can use
each bit we write the distinguish between two different *groups*
of symbols. If the group we identify contains more than one
symbol, we must divide it into two smaller groups and write
another bit, etc., until a single symbol is identified.

The sequence of bits we write for each symbol is that symbol's "code". Let's have a look at these:

Symbol |
Code |
Code Length |

A | 00 | 2 |

B | 01 | 2 |

C | 100 | 3 |

D | 101 | 3 |

E | 1100 | 4 |

F | 1101 | 4 |

G | 1110 | 4 |

H | 1111 | 4 |

Before we can ask ourselves just how efficient this method of encoding is, relative to the other possibilities, we have to figure out just how much freedom we have in choosing encodings. Clearly, by changing the way we divide the symbols into groups, we can change how many bits we write for each symbol. It is required, however, that each symbol can be uniquely identified in the end. Given an N symbol alphabet, with symbols from through , this imples that:

Where is the code length of symbol .

With our example, this expands to 2*1/4 + 2*1/8 +4*1/16 = 1/2 + 1/4 + 1/4 = 1, and we can see that it is true.

So how efficient is this encoding? It depends on how common
each of the symbols is, or to put it another way, for each
possible symbol, how likely is it that a symbol from the source *is*
that symbol?

If all symbols are equally likely, then our average code length is (4*4+2*3+2*2)/8 = 26/8= 3.25 bits. If each symbol is equally likely, then, our encoding is certainly not the best we can do, because an encoding exists that uses only 3 bits per symbol ( 8*1/8 = 1).

But what if A's and B's are more common than the other
symbols? Our encoding uses 2 bits to encode A or B, but spends 4
bits to encode E, F, G, or H. If a symbol is more likely to be A
or B than it is to be E, F, G, or H, however, then we will write
two bits for a symbol more often than we write 4, and our average
code length will be *less* than 3 bits per symbol.

So the efficiency of a particular encoding depends on the
symbol probabilities. It is possible, in fact, to create an *optimal*
encoding (the very best you can do) from the probabilities
themselves, and the mapping is trivial.

If we assign each symbol a probability, then the optimal encoding has = for every symbol. Since every symbol must be one of the symbols from our alphabet, the sum of all the symbol probabilities must be 1, and so the optimal encoding satisfies the constraint on the code lengths.

Returning to our example, we see that our encoding is optimal when the probabilities of A,B,C,D,E,F,G,H are 1/4,1/4,1/8.1/8,1/16,1/16,1/16,1/16. If we assume these probabilities and calculate our average code length (properly, weighting a symbol's length by its probability of occurring), we find that our average code length is 1/2*2+1/4*3+1/4*4 = 1+3/4+1 = 2.75 bits per symbol -- much better than 3.

Given that distribution of symbol probabilities, 2.75 bits per
symbol (on average) is the very best you can do. In Shannon's
information theory, 2.75 bits/symbol is the *entropy* of
the source, and so the method of encoding that assigns optimal
code lengths according to symbol probabilities is referred to as *entropy
encoding*.

So far so good -- if we want to encode symbols from a source optimally, we just assign each symbol a code with length =, where is the probability of the symbol's occurrence.

But wait! This only really works if is a negative power of 2. Otherwise will not be an integer, so the optimal code length will not be an integer. An optimal encoding would then have to write a non-integral number of bits!

The good news is that, though it seems impossible to write, say 3.35 bits to encode a particular symbol, we can actually do it if we have lots of symbols to write.

The very clever trick that makes this possible is this:

- Rather than writing bits for each symbol, we write a single number, in the range [0,1) that encodes the entire source.
- We are still a computer program, though, so we will write this number as a sequence of bits. Specifically, if we write it in binary with a decimal point, a number in the interval [0,1) will be of the form 0.###########...., where each # is either a 0 or a 1. We will encode our number by writing out the bits to the right of the decimal point.
- To encode a symbol, rather than writing bits, we will simply make a decision that reduces the range of numbers we might finally right.

To see how this works, let's examine our situation after we've seen all the symbols in the source, and made all of these "decisions that reduce the range of numbers we might write". We will be left having to write a number in some interval [x,x+R), where 0 <= x < x+R <= 1. R is the final size of this range.

The question is, then, "how many bits do we have to write before we get a number in this range"? The answer is always less than bits. Since a smaller final range means writing more bits, each of the range-reducing decisions we must make increases the number of bits we must write, but by how much?

If we begin with a range [x,x+R) and encode a symbol by deciding that we will finally write a number in the range [x+lR,x+(l+P)R), where 0 <= l < l+P < 1 and so x <= x+lR < x+(l+P)R <= x+R, then we increase the number of bits we must write from to . The difference is:

Oh, how marvellous!

If we have an alphabet of N symbols, with probabilities , and we have a current interval [x,x+R), we can divide the interval into N disjoint sub-intervals, one for each possible symbol , and we can set the size of each symbol's sub-interval to . To encode a symbol, then, we simply decide that we will finally write a number in that symbol's sub-interval. We replace the current interval with the sub-interval corresponding to the appropriate symbol , which will result, in the end, in writing an optimal bits for that symbol -- even if is not an integer. Because the sub-intervals for possible symbols are disjoint, the person or program that will finally read the number we write can tell which symbol we encoded.

Real implementations of arithmetic encoding do not do all this math using very long binary numbers. Instead, numbers x and R that form the current interval [x,x+R) are stored as integers, scaled up by some power of 2 that keeps the interval size, R, in a workable range (typically (32768,65536] or so). Probabilities are stored as integers of slightly smaller size.

When the interval size decreases past its lower bound, the scaling power increases, and the integers x and R are doubled, until R returns to an appropriate size.

x, the lower bound on the interval, does not decrease as R does, and so repeated doubling would cause the integer to overflow. To prevent this, implementations write out the more significant bits of x as it increases.

No matter how big you let x get before you write out its
higher order bits, the possibility exists that the bits you *do*
writeout might need to change due to carry propogation. When your
interval is , for example, it is still possible to finally write . This causes
1000 bits of x to change. Various implementations have different
ways of dealing with this. My arithmetic encoder doesn't output 1
bits, or the most recent 0 bit, immediately. Instead, it simply
counts the number of 1's (say M) produced after the last zero.
Upon producing another 0 bit, then, it will output a 0 followed
by M ones. Upon receiving a 2 bit (i.e., a carry propogation), it
will output a 1 followed by M zeros.

Arithmetic encoding, as presented so far, works just fine when we're writing an endless stream of symbols from some source to some sink. In real systems, though, this is often not the case -- we are typically compressing an input file of finite length, and we expect to be able to decompress this to recreate the same file. To make this possible, we need to tell the decompressor how many symbols we write.

There are two conventional approaches to this:

- The first is simply to write out the length at the beginning of the compressed stream. This consumes about O(log(N)) bits, where log(N) is the length in symbols;
- The second is to reserve a special EOF symbol that
doesn't occur in the input. When we're finished encoding
symbols from our source, we just write the EOF symbol at
the end, which tells the decoder to stop. Because we must
assign a probability to this symbol whenever we
*might*stop encoding, however, this makes the encoding of all other symbols slightly less efficient, and wastes O(N) bits altogether. Of course, the EOF symbol is usually given a very small probability, so the wasted space doesn't add up to much.

Both of these methods are only slightly wasteful when the input is of any significant size. Still, the rest of the arithmetic encoding process is so nearly perfect that we find this blatant waste of bits to be annoying, and we present a better way.

Recall that when we're finished encoding, we have to write a
number in some interval [x,x+R), and we know that we have to
write about bits to do this. *Exactly* how many bits it
takes, however, depends on the bits we *don't* write.

Conventional arithmetic encoders will accept an input of finite length and write out a number with finite precision. By specifying bits, the encoder can make sure that the number it writes is in the required range no matter what the decoder thinks about the bits that follow -- if the encoder were to continue writing any number of 1's and zeros, the number would still be in the proper range.

We can do slightly better if we assume that unwritten bits are zero. An arithmetic encoder that adopts this approach will accept an input of finite length and write out a number of infinite precision that is "finitely odd". "Finitely odd" means that the right-most 1 bit is finitely far from the beginning of the output or, equivalently, that the number it writes is divisible by some negative power of 2.

In the "binary decimal" representation that an arithemtic encoder writes, a finitely odd number is either:

- an infinite string of 0s; or
- a finite number of 0s and 1s, followed by the last 1, followed by an infinite string of zeros.

In any case, when the arithmetic encoder writes a finitely odd number, it simply omits the infinite zero tail, and the decoder assumes that unwritten bits are all zero.

When we're finished encoding, then, and we must write a number
in [x,x+R), we write some finitely odd number in that range, but
we write it with infinite precision. There are an infinite number
of such numbers in *any* range, and we could encode any
one of them. We can achieve better end handling by using the
number we *do* write to tell the decoder where the end of
file is. The procedure is as follows:

During our encoding, we will eventually arrive at the first
place that the input might end, and we know that whether it ends
or not, we will write a number in the range . We simply decide that *if*
the input *does* end here, we will write out the *most
even* number in that range, resulting in the shortest
possible output that identifies that range. Let's call this
number . On
the other hand, if the input *does not* end here, we
simply continue encoding normally.

Eventually, we will arrive at the second place that the input
might end, and we know that whether it ends or not, we must write
a number in .
Now we decide that if the input *does* end here, we will
write out the *most even* number in the range that is *not*. (we can't write
, of course,
because that would mean that means the file ends at the previous
possible position). This number will be . Otherwise we continue on.

Eventually, the input will actually end, and we will write out
the most even number in the final range that does *not*
indicate a previous end.

The complete algorithm does this:

- Whenever the input
*might*end, but does not, we reserve the most even number in the current range that is not already reserved, and then encode the symbol. - When we encode a symbol, we reduce the range of numbers
we might write, and so we can forget about all the
reserved numbers that are
*not*in the new range. - When the input
*does*end, we write out the most even number in the valid range that is*not*already reserved.

Indicating the end of file in this way does not waste any bits
at all, and has another interesting property. If every valid
input is the *prefix* of a *longer* valid input,
then any finitely odd number at all will *decomress* to a
valid input, and that input will *compress* back to the
very same finitely odd number, i.e., an arithmetic compressor
like this is a *bijection* between its inputs and the
finitely odd numbers.

Before going on to the next section, let's return whole-bit encoding, which we discussed first in the entropy coding section above. These are called "prefix-free" codes, because no symbol's bit-code is the prefix of any other. Every such encoding, including the popular Huffman encoding, is a special case of arithmetic encoding that limits the size of the range to negative powers of two, and so the same end treatment we use for arithmetic encoding works for these types of codes as well.

For prefix-free codes, however, the situation is much simpler than with arithmetic encoding in general, because there can be at most two end numbers reserved at any time:

- At the first place where the input might end, there are no ends reserved, so we reserve 0000... after the bits we've already written out to indicate an end at this position.
- At the next place where the file might end, 000.... will still be reserved if we wrote only zeros since the last possible ending position. In this case we reserve 1000.... for this position, otherwise we reserve 000....
- At the next ending position, 000.... will still be
reserved iff:
- 000... was reserved before, and we wrote only zeros to get to the next ending position, or
- 100... was reserved before, and we wrote a 1 followed by zeros to get to the next ending position.

In any case, by the time we get to a new ending position, we will have written at least one whole bit, and so 1000... will certainly not be reserved. At any end position, then, if 000.. is reserved, then 100... is always available for use.

Our compressor so far is a bijection from its valid inputs to the finitely odd numbers. Unfortunately, however, real compressors don't write finitely odd numbers -- they write files, where a file is a finite sequence of bytes, each consisting of 8 bits.

Even though our end handling is very efficient, we'd like to
go one small step further and make our compressor a bijection
from its inputs to *files*. This is where the bijective
property becomes interesting, because then any file at all can be
*decompressed*, even if it wasn't compressed to begin
with. The decompressed file can the be compressed again to
produce exactly the same file we started with.

To accomplish this, we use a trivial prefix-free encoding from files to finitely odd numbers. This is a prefix-free code that simply maps each 8-bit byte to itself, but we apply the whole-bit end treatment to make this encoding into a trivial bijection from files to finitely odd numbers. More importantly, it makes the trivial decoder into a bijection from finitely odd numbers to files.

To make a bijective arithmetic compressor then, we just need to do TrivialDecode(ArithmeticEncode(input)). The arithmetic encoder translates its input into a finitely odd number, and the trivial decoder translates this into a finite sequence of bytes. The corresponding bijective arithmetic decompressor is simply ArithmeticDecode(TrivialEncode(input)).