Reconstructing language ancestry by performing word prediction with neural networks

Student thesis: Master's Thesis


In recent years, computational methods have led to new discoveries in the field of historical linguistics. In my thesis, I applied the machine learning paradigm, succesful in many computing tasks, to historical linguistics. I proposed the task of word prediction: by training a machine learning model on pairs of words in two languages, it learns the sound correspondences between the two languages and should be able to predict unseen words.
I used two neural network models, a recurrent neural network (RNN) encoder-decoder and a structured perceptron, to perform this task. I have shown that, by performing the task of word prediction, results for multiple tasks in historical linguistics can be obtained, such as phylogenetic tree reconstruction, identification of sound correspondences and cognate detection.
On top of this, I showed that the task of word prediction can be extended to phylogenetic word prediction, in which information is shared between language pairs, based on the assumed structure of the ancestry tree. This task could be used for protoform reconstruction and could in the future lead to the direct reconstruction of the optimal tree at prediction time.
By combining insights from two fields, machine learning and historical linguistics, this thesis provides some notable contributions. Firstly, to my knowledge, this is the first publication to use a deep neural networks as a model of sound correspondences in historical linguistics. Secondly, in this thesis I propose a new cognacy prior loss, enabling a neural network to learn more from some training examples than from others. This new loss function has not yet given a clear performance increase in my experiments. I however hope it can be a first step in finding a method to learn more from cognate than non-cognate training examples, a key issue when applying machine learning to historical linguistics, and to other disciplines. Thirdly, I use embedding encoding, inspired by word embeddings in natural language processing, to encode phonemes in historical linguistics. In my experiments, this encoding seems to work better than existing one-hot and phonetic encodings. Furthermore, I developed a method to visualize learned patterns by a neural network by comparing clusterings of network activations and input and target words. Finally, I propose phylogenetic word prediction, sharing weights between language pairs along a phylogenetic tree, which enables protoform reconstruction from a neural network.
With this thesis, I hope to contribute to future insights about the ancestry of languages. By applying computational methods in historical linguistics, advances have been made in recent years. In this thesis, I built further upon this development and proposed a central role for machine learning in historical linguistics. This is motivated both by a practical perspective, machine learning has shown successes in many other research areas, as well as by a fundamental perspective, the observed parallel between regular sound change and generalization in machine learning. I am looking forward to the new findings in historical linguistics that may follow from this new line of methods.
Date of Award25 Jan 2018
Original languageEnglish
Awarding Institution
  • University of Amsterdam


  • deep learning
  • historical linguistics
  • machine learning
  • language change
  • phylogenetics
  • phonetics

Cite this