Supported Encodings: GB2312, Hz, Big5, UTF8, ASCII, ("OTHER" for unrecognized encodings)
Note: To save on server time and space, only the first 100 lines from the file or web document will be used in the guess.
The new, Java version of this tool is now available.
One of the problems in Chinese computing is the variety of internal encodings that can be used to represent Chinese characters. The most common of these are Big5 (used in Taiwan, Hong Kong), GB2312-1980 (the National Standard of the People's Republic of China), and Hz (an e-mail safe varient of GB). Another scheme which I personally hope gains in popularity is Unicode, which encodes about 21,000 simplified and traditional characters. UTF-8 is a variable length encoding of Unicode that is useful on existing systems that do not yet support the UCS-2 form of Unicode. For more information about Chinese (and Japanese and Korean) encoding systems, I recommend CJK.INF by Ken Lunde of Adobe Systems.
What this web application does is to use several heuristics to determine for a given document which encoding system is most likely. It does this in two stages. First it checks to see if characters in the document fit the code ranges for the given encoding system. Then it checks the characters in the document against frequency tables for a given encoding. The encoding system that scores highest on this ranking is the guess shown in the results. If no encoding appears likely, then the application will return "OTHER".
I've taken the perl code used by this application and have put it into a separate file (someday to become a perl module when I learn how to do that). You can download this code (Perl5) and use it in your own programs. I'd like to add other encoding schemes (JIS, CNS, KSC) as I learn more about them.
I'm interested in hearing your ideas and suggestions for this tool. Please visit my contact page to submit your comments. If you came to this page directly, you might also want to take a look at some of my other on-line Chinese tools.