Monday, January 14, 2008

Burrows-Wheeler Transform (BWT): Provenance

Given string S[1..n], its Burrows-Wheeler transform T[1..n] is defined as follows. Consider the set of cyclic suffixes of S, ie., S[i, i+1, ..., n, 1, 2,...,i-1] for all i. Sort these strings lexicographically and output the last character in the ith string in this sorted order as T[i], for each i. Both S and T are of same size, but T has great properties:
  1. T can be compressed very nicely by standard compressors, say run-length compressor,
  2. S can be recovered from T (and the index j where S[1..n] ended up in the sort order) in nearly linear time.
(1) is hard to believe but there is simple intuition because if you consider two substrings that are far apart in S, you will notice that identical characters are brought together in T. Kosaraju, Manzini and others have formally analyzed the compression properties. (2) is hard to believe and works like magic, a gem in string matching algorithm literature. Thanks to researchers like Paolo Ferragina, Giovanni Manzini and others, BWT has become a very powerful tool, something for algorithm textbooks.

Btw, BWT has an intriguing provenance. Wheeler was an eccentric creator who intuited many algorithms, results and ideas. Burrows, who narrates this story (see page 199 here) in a TCS journal issue is much too modest and undersells his own considerable contribution in developing the original ideas into concrete, efficient algorithms and applications. Together they have produced a creative piece of work.

ps: I have been a voyeur. I liked BWT since the first time I saw it in 1994, have used it, maybe even extended it, but I still approach it as an outsider, curious, willing it hold it like a crystal between my fingers, rotate it, examine it, but eventually put it back on the table.

0 Comments:

Post a Comment

<< Home