▲Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 2gautiersblog.blogspot.com

104 points by ajdude 4 days ago | 7 comments

nayuki 13 hours ago [-]

> Another surprising feature of the BWT is that for reversing the permutation, no extra information is needed!

I think this is not true. The way I learned the BWT, after encoding, you need to store the index of the first character (which is a tiny bit of extra information). https://web.archive.org/web/20170325024404/http://marknelson...

amy_petrik 4 hours ago [-]

well here's a big surprise of BWT

BWT is a transformation transforming a string into "suffix tree space"

You can do searches - very fast searches - in BZ2/BWT compression space

Discussed in the '90s in the legendary dr. dobbs

But more significantly, BWT is the basis for most DNA and RNA sequence aligners nowadays - which perform string search/matching in suffix tree (transformed) space. It's like doing frequency stuff in fourier domain, it's like adding is multiplication in logspace - but in weird data string search space. The BWT.

horizion2025 10 hours ago [-]

I also think when looking at how BWT works, it is initially surprising it is invertible even with such little information as the first character. It appears a bit like sorting the letters and that certainly would require a lot more information to back into place. It is a bit magic even when you know the inversion algorithm.

vintermann 12 hours ago [-]

There is a fully bijective version of the BWT which doesn't require the index. It's not what BZip2 uses though.

Someone 12 hours ago [-]

That method adds an extra character to your alphabet, and uses that as a marker in the transformed string to h effectively) store the index of the first character inside the transformed text.

If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.

vintermann 11 hours ago [-]

No, I'm not talking about that. I'm talking about David Scott's bijective BWT variant. It sorts symbols by a slightly different type of context, and so can do away with a sentinel symbol or a stored index entirely. It was described in the wikipedia page last I checked.

Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.

etrez 2 hours ago [-]

Thx, corrected.

Loading comments...