Skip Navigation

Literary and Linguistic Computing 2004 19(4):509-524; doi:10.1093/llc/19.4.509
© 2004 by Association for Literary & Linguistic Computing
This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Baker, P.
Right arrow Articles by Leisher, M.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development*

Paul Baker1, Andrew Hardie1, Tony McEnery1,§, Richard Xiao1, Kalina Bontcheva2, Hamish Cunningham2, Robert Gaizauskas2, Oana Hamza2, Diana Maynard2, Valentin Tablan2, Cristian Ursu2, B. D. Jayaram3 and Mark Leisher4

1 Lancaster University, UK, 2 Sheffield University, UK, 3 Central Institute of Indian Languages, Mysore, India, 4 New Mexico State University Computing Labs, USA

This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.


* For earlier reports of this work see Baker et al. (2002) and Tablan et al. (2002).

§ Correspondence: Tony McEnery, Department of Linguistics and Modern English Language, Lancaster University, Lancaster, LA1 4YT, UK. E-mail: a.mcenery{at}lancaster.ac.uk


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.