National Corpus of Polish

The National Corpus of Polish (Polish : Narodowy Korpus Języka Polskiego NKJP) is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function.

Description

The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science and the Institute of the Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been registered as a research-development project of the Ministry of Science and Higher Education.

The intended size of the whole National Corpus of Polish is over 1 billion words, of which a 300-million word subcorpus has been carefully balanced, and a manually-annotated 1-million corpus has been released under an open license. The corpus is accessible online at http://nkjp.pl/poliqarp/

The corpus contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts.[1]

Search Engines

PELCRA – 1200 million words from three corpora : IPIPAN, PELCRA, PWN. It is easy to use and the results can be downloaded in the form of spreadsheets. A special query syntax also allows the use of morphological expansion and spelling, the search in one query options and flexible lexical phraseological compounds. PELCRA offers also a visualization of the registry function and the generation of time series for words, phrases and idioms.
POLIQARP – Poliqarp gives the ability to search for specific words or phrases. It also allows to find the sequence determined using regular expressions, for example, all occurring in the body of phrases consisting of a noun and an adjective or all of the grammatical forms of the selected word (especially useful for studies on the Polish language.) These operations, both online and offline, can be executed pretty quickly – e.g. simple search queries take no more than a few seconds.

History

The first corpus to emerge was developed by the Institute of the Polish Language, Polish Academy of Sciences (not publicly available), followed by the corpus of PWN publishers, then the corpus of the PELCRA group at the University of Łódź, and finally the corpus of the Institute of Computer Science, Polish Academy of Science. All four teams decided to join forces in 2006, forming the Consortium for the National Corpus of Polish.[2]

References

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] ttp://nkjp.pl/index.php?page=0&lang=1

[2] ttp://nkjp.pl/settings/papers/NKJP_ACADEMIA2009_pl.pdf

Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine