List of text corpora

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. For a more comprehensive list of text corpora, see https://linguistlist.org/sp/GetWRListings.cfm?wrtypeid=1

English language

American National Corpus
Bank of English
British National Corpus
Bergen Corpus of London Teenage Language (COLT)
Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB
Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online
Corpus Resource Database (CoRD), more than 80 English language corpora.[1]
GUM corpus, the open source Georgetown University Multilayer corpus, with very many annotation layers
Google Books Ngram Corpus[2][3]
International Corpus of English
Oxford English Corpus
RE3D (Relationship and Entity Extraction Evaluation Dataset)
Santa Barbara Corpus of Spoken American English
Scottish Corpus of Texts & Speech

European languages

CETENFolha
The Corpus of Electronic Texts
Corpus Inscriptionum Insularum Celticarum (CIIC), covering Primitive Irish inscriptions in Ogham
Google Books Ngram Corpus
The Georgian Language Corpus
Thesaurus Linguae Graecae (Ancient Greek)
Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
Spanish text corpus by Molino de Ideas, which contains 660 million words.[4]
CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania[5]
Reference Corpus of Contemporary Portuguese (CRPC)
Turkish National Corpus[6]
CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane )
TS Corpus - A large set of Turkish corpora. TS Corpus is a Free&Independent Project that aims to build Turkish corpora, NLP tools and linguistic datasets...
MacMorpho - an annotated corpus of Brazilian Portuguese text

East Slavic

South Slavic

West Slavic

German

German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
Free corpus of German mistakes from people with dyslexia

Middle Eastern Languages

Corpus Inscriptionum Semiticarum
Kanaanäische und Aramäische Inschriften
Hamshahri Corpus (Persian)
Persian in MULTEXT-EAST corpus (Persian)[11]
Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
TEP: Tehran English-Persian Parallel Corpus[12]
TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[12]
Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
Neo-Assyrian Text Corpus Project
Quranic Arabic Corpus (Classical Arabic)
Electronic Text Corpus of Sumerian Literature
Open Richly Annotated Cuneiform Corpus
Asosoft text corpus[13]

Devanagari

Nepali Text Corpus (90+ million running words/6.5+ million sentences)

East Asian Languages

Kotonoha Japanese language corpus[14]
LIVAC Synchronous Corpus (Chinese)

South Asian Languages

SinMin dataset[15] (Sinhala)

Parallel corpora of diverse languages

Europarl Corpus - proceedings of the European Parliament from 1996–201

EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[16]
OPUS: Open source Parallel Corpus in many many languages[17]

Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.[18]

NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)[19] (legacy repo)

SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[20]

GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)

The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.[21]

The JRC-Acquis Multilingual Parallel Corpus of the total body of European Union (EU) law: Acquis Communautaire with 231 language pairs.[22]
European Parliament Proceedings Parallel Corpus 1996-2011
The Opus project aims at collecting freely available parallel corpora
Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles
COMPARA - Portuguese/English parallel corpora
TERMSEARCH - English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.
TradooIT - English/French/Spanish - Free Online tools
Nunavut Hansard - English/Inuktitut parallel corpus
ParaSol - A parallel corpus of Slavic and other languages
Glosbe: Multilanguage parallel corpora with online search interface
InterCorp: A multilingual parallel corpus 40 languages aligned with Czech, online search interface
myCAT - Olanto, concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS, with online search interface.
linguatools multilingual parallel corpora, online search interface.
EUR-Lex Corpus - corpus built up of the EUR-Lex database consists of European Union law and other public documents of the European Union
Language Grid - Multilingual service platform that includes parallel text services

Comparable Corpora

WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)
Disambiguating Similar Language Corpora Collection (DSLCC)[23] (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
Wikipedia Comparable Corpora (41 million aligned Wikipedia articles for 253 language pairs)
The TenTen Corpus Family – comparable web corpora of target size 10 billion words. These corpora are available in the corpus management system Sketch Engine, currently, there exist TenTen corpora for more than 30 languages (such as English TenTen corpus,[24] Arabic TenTen corpus,[25] Spanish TenTen corpus,[26] Russian Tenten corpus,[27][28]). The overview of existing TenTen corpora can be found at https://www.sketchengine.co.uk/documentation/tenten-corpora/
Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute.[29] and published in Sketch Engine. More information about the project is on the project websites.

L2 (English) Corpora

Cambridge Learner Corpus[30]
Corpus of Academic Written and Spoken English (CAWSE),[31] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.
English as a Lingua Franca in Academic Settings (ELFA),[32] an academic ELF corpus.[33][34]
International Corpus of Learner English (ICLE),[35] a corpus of learner written English.
Louvain International Database of Spoken English Interlanguage (LINDSEI),[36] a corpus of learner spoken English.
Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[37][38]
University of Pittsburgh English Language Institute Corpus (PELIC)[39]
Vienna-Oxford International Corpus of English (VOICE),[40] an ELF corpus.[33]

References

"Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
"PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
(in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
"CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
"Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
Glazkova, A (2018). "Automatic search for fragments containing biographical information in a natural language text". Proceedings of the Institute for System Programming of RAS. 30 (6): 221–236. doi:10.15514/ISPRAS-2018-30(6)-12.
Rubtsova, Yu (2015). "Constructing a corpus for sentiment classification training". Software & Systems. 1: 72–78. doi:10.15827/0236-235X.109.072-078.
"Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
"Portál | Český národní korpus".
Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba; Krstev, Cvetana; Kotsyba, Natalia; Kaalep, Heiki-Jaan; Ide, Nancy; Garabík, Radovan; Dimitrova, Ludmila; Derzhanski, Ivan; Barbu, Ana-Maria; Erjavec, Tomaž (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. External link in |journal= (help)
"University of Tehran NLP Lab". ece.ut.ac.ir. Archived from the original on 28 January 2014. Retrieved 12 January 2014.
Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
"KOTONOHA「現代日本語書き言葉均衡コーパス」少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
"EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
"OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
"Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 23 November 2020.
Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174. Archived from the original (PDF) on 16 January 2014. Retrieved 12 January 2014.
Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
Ralf Steinberger Ralf; Bruno Pouliquen; Anna Widiger; Camelia Ignat; Tomaž Erjavec; Dan Tufiş; Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.
Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.
Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
"Cambridge English Corpus", Wikipedia, 2019-09-27, retrieved 2020-01-07
"CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学". nottingham.edu.cn. Retrieved 2020-01-07.
"English as a Lingua Franca in Academic Settings". University of Helsinki. 2018-03-23. Retrieved 2020-01-07.
"English as a lingua franca", Wikipedia, 2019-12-14, retrieved 2020-01-07
Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes. 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001.
"ICLE". UCLouvain. Retrieved 2020-01-07.
"LINDSEI". UCLouvain (in French). Retrieved 2020-01-07.
"Trinity Lancaster Corpus | ESRC Centre for Corpus Approaches to Social Science (CASS)". Retrieved 2020-01-07.
Gablasova, D (2019). "The Trinity Lancaster Corpus: Development, Description and Application". International Journal of Learner Corpus Research. 5 (2): 126–158. doi:10.1075/ijlcr.19001.gab.
Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977
"Project". univie.ac.at. Retrieved 2020-01-07.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.

[2] Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.

[3] "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.

[molinolabs-4] (in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.

[coralit-5] "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.

[tnc.o-6] "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.

[7] Glazkova, A (2018). "Automatic search for fragments containing biographical information in a natural language text". Proceedings of the Institute for System Programming of RAS. 30 (6): 221–236. doi:10.15514/ISPRAS-2018-30(6)-12.

[8] Rubtsova, Yu (2015). "Constructing a corpus for sentiment classification training". Software & Systems. 1: 72–78. doi:10.15827/0236-235X.109.072-078.

[search-9] "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.

[10] "Portál | Český národní korpus".

[11] Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba; Krstev, Cvetana; Kotsyba, Natalia; Kaalep, Heiki-Jaan; Ide, Nancy; Garabík, Radovan; Dimitrova, Ludmila; Derzhanski, Ivan; Barbu, Ana-Maria; Erjavec, Tomaž (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. External link in |journal= (help)

[ut-12] "University of Tehran NLP Lab". ece.ut.ac.ir. Archived from the original on 28 January 2014. Retrieved 12 January 2014.

[13] Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074

[kotonoha-14] "KOTONOHA「現代日本語書き言葉均衡コーパス」少納言". kotonoha.gr.jp. Retrieved 12 January 2014.

[15] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.

[sketchengine-16] "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.

[lingfil-17] "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.

[TatoebaSentences-18] "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 23 November 2020.

[19] Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174. Archived from the original (PDF) on 16 January 2014. Retrieved 12 January 2014.

[20] Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.

[21] H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.

[22] Ralf Steinberger Ralf; Bruno Pouliquen; Anna Widiger; Camelia Ignat; Tomaž Erjavec; Dan Tufiş; Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.

[23] Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.

[24] Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.

[25] Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.

[26] Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.

[27] Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).

[28] Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.

[29] Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)

[30] "Cambridge English Corpus", Wikipedia, 2019-09-27, retrieved 2020-01-07

[31] "CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学". nottingham.edu.cn. Retrieved 2020-01-07.

[32] "English as a Lingua Franca in Academic Settings". University of Helsinki. 2018-03-23. Retrieved 2020-01-07.

[English_as_a_lingua_franca-33] "English as a lingua franca", Wikipedia, 2019-12-14, retrieved 2020-01-07

[34] Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes. 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001.

[35] "ICLE". UCLouvain. Retrieved 2020-01-07.

[36] "LINDSEI". UCLouvain (in French). Retrieved 2020-01-07.

[37] "Trinity Lancaster Corpus | ESRC Centre for Corpus Approaches to Social Science (CASS)". Retrieved 2020-01-07.

[38] Gablasova, D (2019). "The Trinity Lancaster Corpus: Development, Description and Application". International Journal of Learner Corpus Research. 5 (2): 126–158. doi:10.1075/ijlcr.19001.gab.

[39] Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977

[40] "Project". univie.ac.at. Retrieved 2020-01-07.