© 2002 by Association for Literary & Linguistic Computing
Frequent Word Sequences and Statistical Stylistics
1 New York University, New York, NY, USA
This paper investigates the relative effectiveness and accuracy of multivariate analysis, specifically cluster analysis, of the frequencies of very frequent words and the frequencies of very frequent word sequences in distinguishing texts by different authors and grouping texts by a single author. Cluster analyses based on frequent words are fairly accurate for groups of texts by known authors, whether the texts are long sections of modern British and US novels or shorter sections of contemporary literary critical texts, but they are only rarely completely accurate. When frequent word sequences are used instead of frequent words or in addition to them, however, the accuracy of the analyses often improves, sometimes dramatically, especially when personal pronouns are eliminated. Analyses based on frequent sequences even provide completely correct results in some cases where analyses based on frequent words fail. They also produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora. Such successes suggest that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
O. Hellwig A chronometric approach to Indian alchemical literature Lit Linguist Computing, December 1, 2009; 24(4): 373 - 383. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Jockers, D. M. Witten, and C. S. Criddle Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification Lit Linguist Computing, February 17, 2009; (2009) fqn040v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Grieve Quantitative Authorship Attribution: An Evaluation of Techniques Lit Linguist Computing, September 1, 2007; 22(3): 251 - 270. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Tambouratzis and M. Vassiliou Employing Thematic Variables for Enhancing Classification Accuracy Within Author Discrimination Experiments Lit Linguist Computing, June 1, 2007; 22(2): 207 - 224. [Abstract] [Full Text] [PDF] |
||||
