Literary and Linguistic Computing Advance Access published online on October 6, 2009
Literary and Linguistic Computing, doi:10.1093/llc/fqp035
Mining a corpus of biographical texts using keywords
National Institute of Informatics, Japan
Correspondence: Mike Conway, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan. E-mail: mike{at}nii.ac.jp
| Abstract |
|---|
Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywords—and the associated concepts of keyness and key-keyness—have inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the naïve Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.