root `WRITEUP` Transformer Compressor
Transformer Compressor A high-efficiency text compressor leveraging large language models `2024-03-28` Project Files • GitHub Repo A few months ago while learning about information theory I came up with a novel compression algorithm, detailed in the project’s readme. The initial spark came when realizing how insane it seemed that everyday compression algorithms were built by manually hard-coding what redundancy meant in the context of the data they were compressing. Or rather, I assumed it was novel. It turns out the technique I “invented” is called prediction by partial matching; see, for example, Data Compression Using Adaptive Coding and Partial String Matching, Cleary and Witten, 1984. I have a vague recollection of attempting to show that the algorithm achieved some level of optimality by deriving an equation along the lines of \(I(x) + I(y \mid x) = I(y) + I(x \mid y)\), which also already exists and is called the chain rule for mutual information. To my credit, the algorithm does work, achieving on a short text sample an impressive \(0.6197\) bits per byte, or a \(12.91\times\) compression ratio. That said, the language model I used was unnecessarily large for a proof-of-concept, with 7 billion parameters, slowing iteration on the program down to a crawl. In Augest 2023, a month or so after I began my work, Fabrice Bellard realeased his ts_zip utility, forever sealing the fate of the project with a final blow to my ambition.
© 2025 Emilien Breton `Transformer Compressor/WRITEUP.md`

Transformer Compressor

A high-efficiency text compressor leveraging large language models 2024-03-28

A few months ago while learning about information theory I came up with a novel compression algorithm, detailed in the project’s readme. The initial spark came when realizing how insane it seemed that everyday compression algorithms were built by manually hard-coding what redundancy meant in the context of the data they were compressing.

Or rather, I assumed it was novel. It turns out the technique I “invented” is called prediction by partial matching; see, for example, Data Compression Using Adaptive Coding and Partial String Matching, Cleary and Witten, 1984. I have a vague recollection of attempting to show that the algorithm achieved some level of optimality by deriving an equation along the lines of \(I(x) + I(y \mid x) = I(y) + I(x \mid y)\), which also already exists and is called the chain rule for mutual information.

To my credit, the algorithm does work, achieving on a short text sample an impressive \(0.6197\) bits per byte, or a \(12.91\times\) compression ratio. That said, the language model I used was unnecessarily large for a proof-of-concept, with 7 billion parameters, slowing iteration on the program down to a crawl. In Augest 2023, a month or so after I began my work, Fabrice Bellard realeased his ts_zip utility, forever sealing the fate of the project with a final blow to my ambition.