Lucene - Capitalization and Numbers

Lucene is the an embeddable search engine used in many (many...) Java applications. It's extremely powerful, flexible and configurable. Exactly this makes it perfect to embed. Although you have to watch closely how you use it.

Problem: how do I correctly index the following text: "upgrade: version 1.5.3 part 0001.ABCD for TRex"?

Things you have to take into account:

TRex might be confused with Trex, trex, t-rex, t-Rex, etc.
1.5.3 is really "1.5.3" don't ignore the dots,
0001.ABCD is really "0001.abcd", capitalization and don't ignore the dots.

Sounds easy enough, just use the "LowerCaseFilter" or even better, the "LowerCaseTokenizer". There's a subtle problem with this approach. Namely, the fact that the "LowerCaseTokenizer" is in fact a "LetterTokenizer", this uses the Character.isLetter method in Java. This means that only letters are part of tokens, numbers and punctionation isn't, in fact, these cause the tokens to split, this is undesirable.

What you do want in this case is a tokenizer that tokenizes only on whitespace (with Character.isWhitespace) and normalizes letters to uppercase (with Character.toUppercase).

Easy, efficient, elegant and easilly fits in with the rest of Lucene!