Abstract
In the context of some machine learning applications, obtaining data points is a relatively simple process, yet labeling them could become quite expensive or tedious. Such scenarios lead to datasets with few labeled points and a higher number of unlabeled ones. Semi-supervised classification techniques combine labeled and unlabeled data during the learning process in order to improve baseline supervised methods that use only labeled data. Unfortunately, most successful semi-supervised classifiers are complex structures that do not allow explaining their predictions, thus behaving like black boxes. However, there is an increasing number of problem domains in which experts demand a clear understanding of the decision process. Intrinsically interpretable classifiers (i.e., white-box models) are transparent structures that allow performing predictions, obtaining an associated explanation, and inspecting the model as a whole. Nevertheless, these advantages generally come at the cost of performance in terms of accuracy.In this thesis, we propose the self-labeling grey-box model, a semi-supervised classifier aiming at providing a suitable balance between accuracy and interpretability. The self-labeling grey-box uses an accurate black-box classifier for labeling the unlabeled data and a white-box surrogate classifier for building an interpretable model. Since the self-labeling process can propagate errors, we propose two amending procedures based on class membership probabilities and certainty measures coming from the field of rough sets theory. The experimental study shows the influence of increasing ratios of labeled and unlabeled data across benchmark datasets. Moreover, we study the effect of different black-box, white-box base classifiers, as well as the two proposed amending procedures in terms of both accuracy and interpretability. The results support the interpretability of our classifier using simplicity and transparency as proxies while attaining superior prediction rates when compared with state-of-the-art self-labeling classifiers. Additionally, we illustrate the applicability of the self-labeling grey-box classifier with preliminary results in two case studies from the field of bioinformatics. The first task concerns the detection of disease-causing genomic variants in a rare disease, while the second application tackles the prediction of early folding in proteins. Both case studies require an interpretable model able to leverage extra unlabeled data.
Date of Award | 9 Oct 2020 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Ann Nowe (Promotor), María Matilde García Lorenzo (Co-promotor), Beat Signer (Jury), Wim Vranken (Jury), Tom Lenaerts (Jury), Sonia Van Dooren (Jury), Koen Vanhoof (Jury) & Chris Cornelis (Jury) |