© 1996 by Association for Literary & Linguistic Computing
An upper bound estimate for the entropy of Korean texts
The University of Suwon, Korea, Korea Research and Development Information Centre Z Corresponding author at: Computer Science Department, Suwon University, Suwon PO Box NN-N8 440-600, Korea. Email: yshan@csking.kaist.ac.kr
The entropy of printed languages suggests how predictable the language usages are and how efficiently the printed texts can be handled in text processing. In this paper, for the first time we present an upper bound estimate of the entropy for printed Korean. We obtained 6.01 bits for each Korean syllable. The method to compute the entropy makes use of a stochastic language model for Korean whose probabilistic parameters are estimated from a sample of 5.5 million word-phrases. The stochastic model was designed to best utilize the structure of Korean. An entropy estimate is computed by running the stochastic model on a sample of 1.45 million units that is carefully arranged to represent a wide range of printed Korean styles.