Contributions to Clustering and Feature Selection Methods for Clustering

    Scriptie/masterproef: Doctoral Thesis

    Uittreksel

    In pharmaceutical research, many new analytical chemical techniques, certainly hyphenated techniques, such as high-performance liquid chromatography (HPLC) coupled to diode array detection (HPLC-DAD), or gas chromatography coupled to mass spectrometry (GC-MS), allow obtaining several hundred or thousands of measurements (variables or features) per sample in a short time. Techniques such as combinatorial chemistry and high throughput screening allow synthesizing and measuring many thousands of samples a day. The large number of objects (samples) and the high-dimensionality of such data sets complicate the visualization and interpretation of the data.
    Clustering methods where investigated as a way to handle such large numbers of objects. The Neural Gas algorithm, originally developed for vector quantization, was introduced in the chemometrics and analytical chemistry field as an interesting clustering method. Comparisons with more frequently applied methods such as Kohonen's Self-Organising maps and K-means, and demonstration of some applications, showed its advantageous speed and statistical properties. Novel combinations with visualization techniques, such as hierarchical clustering and projection methods, to improve its exploratory capabilities, were proposed and demonstrated. The algorithm was also successfully introduced in the field of environmetrics (clustering and interpretation of environmental data).
    Whereas clustering methods help to handle large numbers of objects, feature selection methods help to handle large numbers of features. First of all, the identification of the most interesting features is useful for data interpretation and knowledge discovery. The removal of irrelevant or redundant features can help to save measurement costs and time, to save storage costs, to reduce the computation time and to improve the performance of learning algorithms (such as clustering.)
    Many feature selection methods are available, but most of them work only for supervised data where class information (labels) is available. For many data sets, such class information is not available (unsupervised data), and only unsupervised feature selection methods can be used. As unsupervised feature selection has been investigated only recently, and as there is at the moment no universally accepted unsupervised feature selection method, we investigated and proposed a few such methods in this thesis.

    In part II, it is discussed how feature selection can be useful for data interpretation and knowledge discovery, how the removal of redundant or irrelevant features can save measurement time and costs, storage costs and how it can improve the performance of learning algorithms such as clustering. While many methods exist for supervised feature selection, no universally accepted unsupervised feature selection technique exists. Therefore, new unsupervised approaches were investigated.

    An unsupervised feature selection approach based on rough set theory (RST) [5, 6] is proposed in chapter 5. The application is demonstrated on a taxonomic study of Pseudomonas species.

    In chapter 6 another unsupervised feature selection approach based on genetic algorithms [7] is proposed. The method is demonstrated on the Pseudomonas dataset and a Maillard gas chromatogram data set.

    In chapter 7 is demonstrated how CART (Classification and Regression Trees) [8, 9] and its recent extension (for more than one response variable) MRT (Multivariate Regression Trees) [10] can be used for supervised feature selection. A novel unsupervised feature selection approach based on this MRT method, AAMRT (Auto-Associative Multivariate Regression Trees), will be proposed. The method is demonstrated on a synthetic data set, the Pseudomonas data set, a Maillard flavour data set, and a viruses data set.


    In this thesis it is demonstrated how clustering and feature selection can be used to help with the exploration, visualization and interpretation of high-dimensional data sets in unsupervised settings.

    The introduction to clustering methods (part I, chapter 2) shows that a very wide range of methods is available, but that the quest for better methods is still ongoing.

    In part II, it is discussed how feature selection can be useful for data interpretation and knowledge discovery, how the removal of redundant or irrelevant features can save measurement time and costs, storage costs and how it can improve the performance of learning algorithms such as clustering. While many methods exist for supervised feature selection, no universally accepted unsupervised feature selection technique exists. Therefore, new unsupervised approaches were investigated.

    The Rough Set Theory [4,5] was already known to be useful for supervised feature selection. We elaborated an approach for unsupervised feature selection. In a wrapper style approach, the results of (hierarchical) clustering based on the full feature set are compared with those based on reduced feature sets (chosen as rough set reducts). The reduct resulting in the most similar clustering as the complete feature set gives us the best feature subset.
    The rough set based unsupervised feature selection method proved adequate for hierarchical clustering, in casu the taxonomic study of Pseudomonas flora, where the reduced feature subset resulted in a clustering similar as the complete feature set, and resembling the known taxonomy of the species. But as the method is still rather computationally intensive and only applicable on discrete data or after discretization of non-discrete data, other methods were investigated.

    Another unsupervised feature selection method, also applicable on continuous data and based on genetic algorithms (GA) [6, 7], was proposed. The use of genetic algorithms for supervised feature selection is already known [8-10]. In our unsupervised approach, the GA fitness function is constructed to minimize the difference between the dissimilarity matrix of the complete feature set and the ones of the considered reduced feature sets. This means that it is tried to preserve the cluster structure and inner distances between the objects while reducing the number of features. The approach was shown to be adequate for hierarchical clustering (demonstrated on the Maillard gas chromatogram and the Pseudomonas taxonomic data sets) and to perform somewhat better than the Rough Set based feature selection method. As most clustering methods are based explicitly or implicitly on using dissimilarity matrices, the methodology should also be useful for other types of clustering, such as nearest neighbour methods or neural network based methods.

    We focused the rough set based and genetic algorithms based methods on preserving the hierarchical clustering structure. Hierarchical clustering is often preferred to non-hierarchical clustering when the 'real' number of clusters is not known or when one is interested in the relationships between the objects, such as in taxonomical studies. However, feature selection for hierarchical clustering is difficult, as the selected feature subset should, compared to the original feature set, not only preserve as good as possible the clustering in a number of groups, but also the complete tree. While our approaches focussed on these hierarchical clustering aspects, the methods are just as easy, if not easier, to apply to non-hierarchical clustering.

    AAMRT (Auto-Associative Multivariate Regression Trees) was proposed as a novel unsupervised feature selection method, based on decision trees. More precisely, AAMRT is based on MRT (Multivariate Regression Trees) [11], which in turn is an extension of CART (Classification and Regression Trees) [12, 13]. CART is a decision tree method well known for, amongst other applications, supervised feature selection. MRT is a CART extension to handle simultaneously several response variables. AAMRT can be used in unsupervised settings by using the original variables (x) not only as explanatory variables (x) but also as response variables (y=x). Since (AA)MRT is grouping the objects into groups with similar response variables, this means that the variables are found which are most responsible for the cluster structure in the data. Whereas the rough set based and genetic algorithms based methods focussed on preserving a clustering as similar as possible as that based on the complete feature set, the AAMRT method can be used to improve cluster detection, by revealing cluster structures masked by redundant or irrelevant features. Moreover, AAMRT is a relatively fast and simple method with easy (graphical) interpretation and the same good statistical properties as CART and MRT (nonparametric method, invariant to monotonic transformations of the explanatory variables, robust to noise and outliers, able to handle missing data, and with model selection by cross-validation). Besides the selected feature set, AAMRT also reveals the resulting groups, the interactions of the variables, and the alternative and surrogate variables. This additional information proved, compared to other feature selection methods, useful for improved interpretability. Not only can AAMRT be used for feature selection for other clustering or learning methods, it can also be used itself for clustering. CART, MRT and AAMRT were compared and demonstrated on several data sets (synthetic, viruses, bacteria and flavours)
    In the mean time, the AAMRT method was found useful for the selection of orthogonal chromatographic systems, giving results consistent with other methods [14].

    Besides the three above-mentioned approaches for selecting a feature subset, we found two-way hierarchical clustering combined with a heat map an interesting tool to assist in understanding and selecting features and to visualize the relationships between objects, features and clusters. This was demonstrated in a study of orthogonal chromatographic systems for the determination of drug impurities [15].


    Summarizing, it was shown that the Neural Gas algorithm can be used as an interesting clustering method, especially combined with visualization techniques, such as hierarchical clustering and projection methods.
    From the three proposed unsupervised feature selection methods, the AAMRT method is most interesting, as it not only can preserve cluster quality while reducing the number of features, but even can improve cluster detection. Also the AAMRT speed, statistical properties and interpretation tools are advantageous. Two-way hierarchical clustering is an interesting and simple tool to visualize small data sets (not much more than 100 objects and variables) in unsupervised feature selection studies.

    These methods were demonstrated on synthetic data sets and data sets related to the pharmaceutical sciences (taxonomic studies of bacteria and viruses, NIR spectra, gas chromatograms from Maillard reaction products, flavour-sensory data, and chromatographic data sets to select orthogonal systems for the determination of drug impurities). The proposed clustering and feature selection methods are generally useful for knowledge discovery and data reduction of data sets ever increasing in size and complexity as produced in new pharmaceutical fields (high throughput screening, high content screening, combinatorial chemistry, genomics, proteomics, ...) and other sciences. Let us consider an example of possible application in the field of genomics. Microarray technology allows simultaneously measuring of the expression level of thousands of genes across related samples (normal cells, diseased cells, cells after drug treatment, one subtype of cancer versus another, etc). The main challenge is to find the relevant genes, which discriminate between such different samples. Typically, supervised feature selection is used with labeled samples. However, it is recognized that the automatic discovery of samples' phenotypes or the discovery of, for instance, new cancer subtypes is important in gene expression data analysis [16]. As one is then interested in a simultaneous unsupervised feature selection (of the genes) and clustering (of the cell samples), it could be interesting to evaluate our AAMRT method in such analysis.


    Further future prospects

    After the publication of our Neural-Gas paper, the quest for more efficient clustering methods has of course continued. As mentioned in chapter 3, others have since worked on related improved methods such as the (adapted) Growing Neural-Gas method (GNG*) and Growing k-means [17]. For the largest data sets where these methods could be too computationally intensive, it is interesting to investigate single scan clustering methods, such as those based on DBSCAN, or (maybe at expense of quality) grid-based clustering methods.

    Further investigation of our unsupervised feature selection method, AAMRT, could certainly be useful. Questions to be asked could be: How does it perform for other, e.g. larger, data sets? Are the feature subsets suggested by cross-validation indeed the feature subsets of most optimal size for most general purposes? How does the clustering, which AAMRT produces, compare to that of other clustering methods?


    References

    [1] M. Martinetz, S. Berkovich and K. Schulten, "Neural-gas" network for vector quantization and its application to time series prediction, IEEE Trans. Neural Networks, 4 (1993) 558-569
    [2] T. Kohonen, Self-organizing maps, 2nd ed, Springer-Verlag, Berlin, 1997
    [3] J. MacQueen, Some methods for classification and analysis of multivariate observations, In L. M. Le Cam and J. Neyman, ed., Proc. of the Fifth Berkeley Symp. Math. Stat. Prob, I, Statistics, University of California Press, Berkeley and Los Angeles, CA (1967) 281-297
    [4] Z. Pawlak, Rough sets, Int. J. Computer and Information Sci., 11 (1982) 341-356
    [5] W. Ziarko (ed.), Rough Sets, Fuzzy Sets and Knowledge Discovery, Proc. of the International Workshop on Rough Sets and Knowledge Discovery, Springer-Verlag, New York, 1994
    [6] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, Addison-Wesley Publishing Company, Massachusetts, 1989
    [7] J. Devillers, ed., Genetic Algorithms in Molecular Modeling, Academic Press, London, 1996
    [8] R. Leardi, R. Boggia, and M. Terrile, Genetic Algorithms as a strategy for feature selection, J. Chemom., 6 (1992) 267-281
    [9] J. Yang and V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intelligent Systems (Special Issue on Feature Transformation and Subset Selection), 13, 2, (1998), 44-49
    [10] W. Siedlecki and J. Sklansky, A Note on Genetic Algorithms for Large-Scale Feature Selection, Pattern Recognition Letters, 10 (1989) 335-347
    [11] G. De'ath, Multivariate regression trees: a new technique for modeling species-environment relationships, Ecology 83 (2002) 1105-1117
    [12] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees, Wadsworth, Monterey, 1984
    [13] G. De'ath, K. E. Fabricius, Classification and regression trees: a powerful yet simple technique for ecological data analysis, Ecology 81 (2000) 3178-3192
    [14] R. Put, E. Van Gyseghem, Y. Vander Heyden, Selection of orthogonal reversed-phase HPLC systems by univariate and auto-associative multivariate regression trees, accepted for publication in Journal of Chromatography A (2005)
    [15] E. Van Gyseghem, S. Van Hemelryck, M. Daszykowski, F. Questier, D.L. Massart and Y. Vander Heyden, Determining orthogonal chromatographic systems prior to the development of methods to characterise impurities in drugs, Journal of Chromatography A 988 (2003) 77-93
    [16] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science, 286 (1999) 531-537
    [17] M. Daszykowski, B. Walczak, D. L. Massart, On the optimal partitioning of data with K-means, Growing K-means, Neural Gas and Growing Neural Gas, J. Chem. Inf. Comput. Sci. 42 (2002) 1378-1389
    Datum Prijs13 mei 2005
    TaalEnglish
    Toekennende instantie
    • Vrije Universiteit Brussel
    BegeleiderDesire Massart (Promotor), Yvan Vander Heyden (Co-promotor), Beata Walczak (Co-promotor) & Bartholomeus Rombaut (Jury)

    Citeer dit

    '