codeb34v3r

codeb34v3r / TextTokenizer / 0.1.0

README.md
Lucene StandardTokenizer

A grammar-based tokenizer constructed with JFlex.

As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.