© 2001 by Association for Literary & Linguistic Computing
Articles |
Statistical Stylistics and Authorship Attribution: an Empirical Investigation
New York University USA
David L. Hoover, Department of English, New York University, 726 Broadway, 7th Floor, New York, NY 10003, USA E-mail: david.hoover{at}nyu.edu
This paper investigates the effectiveness and accuracy of multivariate analysis, specifically cluster analysis, of the frequencies of very frequent words in distinguishing texts by different authors and grouping texts by a single author. An examination of groups of texts by known authors shows that cluster analyses typically achieve an accuracy rate of less than 90 per cent for contemporary novels, modern British and American novels, and contemporary literary critical texts, both on relatively large groups of texts and on smaller subsets of those texts. Although limiting the analysis to third-person narration, and disambiguating homographic function words improves the results, inaccuracies remain. Furthermore, small groups of problematic texts extracted from the larger groups in simulated authorship studies also fail to cluster correctly. These failures suggest general rather than local problems with the technique, and cast doubt on the effectiveness of cluster analysis for authorship attribution and stylistic study.