IT applications in the humanities: Estonian Interlanguage Corpus

Pille Eslon
Mart Laanpere

Estonian Interlanguage Corpus established at Tallinn University is a collection of written texts of Estonian as the second language and as the foreign language with a number of sub-corpora, user interface, a multi-level annotation and tagging system, statistics module, option to automatic parsing of texts, etc. By combining the different characteristics of text (e.g. genre, number of words or sentences), types of errors, and metadata on language learner (e.g. first language, country of origin, gender, education, level of language proficiency) the user interface of the Estonian Interlanguage Corpus allows carrying out multi-level inquiries.

As of October 2013, the corpus contains 11,720 texts with a total of 3,185,591 running words, and the average length of each text is 272 running words.

Table: Sub-corpora of Estonian Interlanguage Corpus

Sub-corpus No. of texts No. of running words Average length of text (no. of words)
K2 main corpus 3,151 804,094 255
K2 national examinations 7,856 1,989,844 253
K2 open contests and olympiads 63 58,684 932
K2 academic writing in Estonian 13 14,716 1132
K1 academic writing in Estonian * 4 3,339 835
K1 Russian (reference corpus) 370 209,885 567
K3 Russian (reference corpus) 273 101,566 372

* The sub-corpus was compiled at the Centre of Academic Language, University of Tallinn (P. Nemvalts)

Estonian Interlanguage Corpus can be used in (1) empirical and applied research (e.g. acquiring Estonian as L2, language proficiency levels of the European Council, usage patterns of Estonian language, language development tendencies); (2) training future language teachers and linguists (e.g. error analysis, frequency of words and forms, cluster analysis); (3) further training of active language teachers (e.g. using the corpora in language teaching, using the corpus data in assessing the validity of textbooks), etc.