Undetectable Data Corruption in JB2/JBIG2

Destroying Our Past, One Character at a Time

image link-topic-sf0.jpg

1. Undetectable Data Corruption with JB2/JBIG2 Formats

"JBIG2" and its derivative "JB2" are image compression formats. They both employ a procedure which renders them unfit for any use in which one actually needs assurance that the text you read is the text you scanned.

These formats attempt to recognize characters on a page. Having done so, they store only one copy of the character image. Each time they recognize that character again, they simply store the fact that the character occurred and display it using the single, common copy. So if the JBIG2/JB2 algorithm recognizes, say, an '8' it stores only the compressed image of that '8'. Each time it recognizes another '8', it throws away the actual image of it and uses, instead, the first stored '8'. Obviously, this can significantly reduced the amount of storage necessary.

Equally obviously - one would think - this introduces a fatal problem. If the JBIG2/JB2 software mistakenly recognizes an '8' where in fact a '6' was written, it throws away the '6', stores the "fact" that an '8' occurred, and displays it as an '8'. Unless the JBIG2/JB2 software recognizes each character perfectly it changes the content of the material that it compresses. Moreover, it changes this content irretrievably and in a way that cannot be detected. (If a document contains a '6' which looks a bit like an '8', conventional image compression software preserves the original character so that you can examine it and decide for yourself. With JBIG2/JB2 compression, this information is thrown away.)

This problem first came to public attention when photocopiers (which are now really combinations of scanners plus printers) which used JBIG2 started producing photocopied bank statements which had different numbers than the originals.

The JBIG2 algorithm is used as a part of the encoding of the PDFs for Google Books (for all of the black-and-white text portions). The JB2 algorithm is used in all DjVu format documents.

This is a serious problem, and by definition JBIG2 and JB2 are in principle unfit for any use in any document. Regrettably, since in practice most of the literature of the western world will survive only through Google Books scans, these algorithms are falsifying our history for the sake of saving a trivial amount of storage space.