Lucene - Capitalization and Numbers
Lucene is the an embeddable search engine used in many (many...) Java applications. It's extremely powerful, flexible and configurable. Exactly this makes it perfect to embed. Although you have to watch closely how you use it.
Problem: how do I correctly index the following text: "upgrade: version 1.5.3 part 0001.ABCD for TRex"?
Things you have to take into account:
- TRex might be confused with Trex, trex, t-rex, t-Rex, etc.
- 1.5.3 is really "1.5.3" don't ignore the dots,
- 0001.ABCD is really "0001.abcd", capitalization and don't ignore the dots.
What you do want in this case is a tokenizer that tokenizes only on whitespace (with Character.isWhitespace) and normalizes letters to uppercase (with Character.toUppercase).
Easy, efficient, elegant and easilly fits in with the rest of Lucene!