The segmenter requires Perl to run. It is a free and easily downloaded program.
I have also made available a Java version of the segmenter that works with Big5, GB, and UTF-8 encoded text files.
Usage: java -jar segmenter.jar [-b|-g|-8] inputfile.txt
-b Big5, -g GB2312, -8 UTF-8
Segmented text will be saved to inputfile.txt.seg
Words can be added or deleted directly from the lexicon file. The segmenter has algorithms for grouping together the characters in a name, especially for Chinese and Western names, but Japanese and South-east Asian names may not work well yet.
The segmentation process is also a perfect time to identify interesting "entities" in the text. These could include dates, times, person names, locations, money amounts, organization names, and percentages. This collection of interesting nouns is often refered to as "named entities" and the process of identifying them as "named entity extraction". There is already code to identify person names and number amounts in the segmenter, and I will adding more code to find the rest in the future.
The segmenter works with a version of the maximal matching algorithm. When looking for words, it attempts to match the longest word possible. This simple algorithm is suprisingly effective, given a large and diverse lexicon, but there also need to be ways of dealing with ambiguous word divisions, unkown proper names, and other words not in the lexicon. I currently have algorithms for finding names, and am researching ways to better handle ambiguous word boundaries and unknown words. Additional knowledge that would be useful would be a list of characters and whether they are bound or unbound. A segmentation that would leave a bound character by itself would not be allowed. A statistical way of choosing amongst ambiguous segmentations would also be useful.
More information on segmenting Chinese text can be found at ChineseComputing.com.
Contact Erik Peterson at this contact page with questions or comments. Please visit Online Chinese Tools for many more useful Chinese-related software tools.