Корпуси текстів: здобутки України та перспективи врахування закордонного досвіду


Автор:	Фокін С.
Назва:	Корпуси текстів: здобутки України та перспективи врахування закордонного досвіду
Видавництво:	Київський університет
Рік:	2018
Сторінок:	С. 51-64
Тип документу:	Стаття
Головний документ:	Вісник Київського національного університету імені Тараса Шевченка
Анотація:	Розглянуто дев"ять текстових корпусів української мови, порівнюються їхні характеристики, можливості використання в дослідницькій роботі. З"ясовано, що найсуттєвішими параметрами електронних корпусів є розмітка як текстів у цілому (жанрово-тематична, ареальна, хронологічна, соціологічна), так і графічних слів у ньому (частиномовна, семантична); зараз бракує розмітки за дискурсивними характеристиками. Узагальнено принципи пошуку: можливість шукати слово, лексему, словосполучення, речення, а також маски виразів в узагальненому вигляді, однак виклик найближчого майбутнього - семантична розмітка та створення корпусів різних дискурсів. Одним із перших корпусів на матеріалі українсько- мовних текстів став Корпус української мови, розроблений лабораторією комп"ютерної лінгвістики Інституту філології Київського національного університету імені Тараса Шевченка [Корпус текстів української мови], у відкритому доступі з 2010 р.; у розробці брали участь співробітники лабораторії: Н. П. Дарчук (керівник проекту), В. М. Сорокін (програміст), О. Б. Сірук, Я. В. Ходаківська, Н. Г. Чейлитко, М. О. Лангенбах. Though five or more corpora of the Ukrainian language exist since 2010 or earlier, the majority of them remain unknown to researchers and corpus-&based studies in Ukrainian philology are seen rather as exotic and exceptional cases. In the present study we offer an overview of nine Ukrainian corpora, among which the widest and the fullest are "Ukrainian Language Corpus" at web-portal mova.info &and "GRAK" ("General Regionally Annotated Corpus of Ukrainian). Two of them make part of corpora collections ("Leeds Corpora Collection") and ("Corpora Collection of Leipzig University"); two corpora are made on the basis of electronic document archi&ves, which appears to demonstrate that nowadays any set of electronic textual documents corresponding to a common criterion are convertible into a simple corpus. Today"s large corpora provide the possibilities of making searching queries according to& multiple criteria: subject, period, style, gender, sociolect, etc. Another useful feature of large general corpora is searching by means of regular expressions and "wildcard symbols" which provide the possibility of making a set of queries at once c&orresponding to a certain search mask. These formats vary from one corpus to another: from asterisks to classical regular expressions and CQL queries; not all Ukrainian corpora still offer the possibility of searching phrases larger than two words. M&ost large corpora are POS-annotated, which is a great achievement of computational linguistic, but the actual challenge is the semantic annotation, which is being developed at present for the corpus at mova.info. Among the parameters of the corpus th&ose which matter for a research are, besides its volume (many case studies do not require a corpus larger than a hundred thousand tokens), the document annotation and text components annotation. As more and more researches are discourse-oriented, it &is to anticipate that more discourse-oriented corpora will be appearing in the near future.

Опис документа:

Пошук: заповніть хоча б одне з полів