Projecten per jaar
Samenvatting
This article discusses the automatic linguistic enrichment of historical Dutch corpora through the use of part-of-speech tagging and lemmatization. Such a type of enrichment facilitates linguistic research where manual annotation is unfeasible. We built a neural network-based model using the PIE framework and performed an in-depth error analysis, in order to identify the strengths and weaknesses of each approach with respect to labeling historical data. In order to do so, we experimented with two data sets: the Corpus Gysseling (13th century texts) and the Corpus van Reenen/Mulder (14th century texts). We used two different statistical approaches (MBT and HunPos) as baselines for our neural approach. MBT is a memory-based tagger frequently used for modern Dutch, while HunPos is an open source trigram tagger. We present thoroughly analyzed results. In general, the neural model scores better than the two baselines, even with limited training data. Based on the error analysis, we propose several strategies for future research in order to improve the labeling of historical Dutch.
Originele taal-2 | English |
---|---|
Pagina's (van-tot) | 57-72 |
Aantal pagina's | 16 |
Tijdschrift | Computational Linguistics in the Netherlands Journal |
Volume | 10 |
Status | Published - 13 dec 2020 |
Evenement | COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS - Utrecht University, Utrecht, Netherlands Duur: 30 jan 2020 → … Congresnummer: 30 |
Bibliografische nota
Funding Information:Part of this research was done as an internship at the Instituut voor de Nederlandse Taal by Silke Creten, who is currently working as a PhD student at KU Leuven. Peter Dekker was supported by funding from the Flemish Government under the Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen. programme. Supervision of the project was done by Vincent Vandeghinste, senior researcher at the Instituut voor de Nederlandse Taal and part-time researcher at Leuven.AI and the Centre for Computational Linguistics of KU Leuven. We want to thank the anonymous reviewers for their valuable suggestions that have helped greatly improve the present article.
Funding Information:
Part of this research was done as an internship at the Instituut voor de Nederlandse Taal by Silke Creten, who is currently working as a PhD student at KU Leuven. Peter Dekker was supported by funding from the Flemish Government under the Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen. programme. Supervision of the project was done by Vincent Vandeghinste, senior researcher at the Instituut voor de Nederlandse Taal and part-time researcher at Leuven.AI and the Centre for Computational Linguistics of KU Leuven.
Publisher Copyright:
© 2020 Silke Creten, Peter Dekker, Vincent Vandeghinste.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
Vingerafdruk
Duik in de onderzoeksthema's van 'Linguistic Enrichment of Historical Dutch using Deep Learning'. Samen vormen ze een unieke vingerafdruk.Projecten
- 1 Actief
-
VLAAI1: Subsidie: Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen
1/07/19 → 31/12/23
Project: Toegepast
Activiteiten
- 1 Talk or presentation at a conference
-
Linguistic enrichment of historical Dutch using deep learning
Silke Creten (Speaker), Peter Dekker (Speaker) & Vincent Vandeghinste (Speaker)
30 jan 2020Activiteit: Talk or presentation at a conference