File Name: cluster analysis and display of genome wide expression patterns .zip
Bioinformatics and Genome Analysis pp Cite as.
Summary: In this paper we present a data mining system, which allows the application of different clustering and cluster validity algorithms for DNA microarray data.
Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are 1 a similarity measure that reduces the number of false positives, 2 a new clustering algorithm designed specifically for grouping gene expression patterns, and 3 an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups. The advent of oligonucleotide arrays and cDNA microarrays Fodor et al.
Summary: In this paper we present a data mining system, which allows the application of different clustering and cluster validity algorithms for DNA microarray data. This tool may improve the quality of the data analysis results, and may support the prediction of the number of relevant clusters in the microarray datasets.
This systematic evaluation approach may significantly aid genome expression analyses for knowledge discovery applications. The developed software system may be effectively used for clustering and validating not only DNA microarray expression analysis applications but also other biomedical and physical data with no limitations.
Contact: Nadia. Bolshakova cs. The fast growth of data collections in the science and business applications as well as the need to analyse and extract useful knowledge from this data leads to a new generation of tools and techniques grouped under the term data mining. The recent advent of DNA microarray or gene chips technologies allows the measuring of the simultaneous gene expression of thousands of genes under multiple experimental conditions Eisen et al.
This technology is having a significant impact on genomic and post-genomic studies Schena et al. For instance, the accurate classification of tumours is essential for a successful diagnosis and treatment of cancer.
One of the problems associated with cancer tumour classification is the identification of new classes using gene expression profiles. There are two key aspects in this problem: 1 estimation of the number of clusters in the dataset; and 2 classification of unknown tumour samples based on these clusters Dudoit and Fridlyand, In this paper we address the first of these problems.
This paper presents a data mining framework to evaluate DNA microarray data clustering results. A principal step in the analysis of gene expression data is the detection of samples or gene groups with similar expression patterns. Several clustering algorithms have been applied to the analysis of gene expression data Granzow et al.
Also solutions to systematically evaluate the quality of the clusters have been presented Bolshakova and Azuaje, The prediction of the correct number of clusters is a fundamental problem in unsupervised classification. Many clustering algorithms require the definition of the number of clusters beforehand. To overcome this problem, various cluster validity indices have been proposed to assess the quality of a clustering partition Azuaje, This approach consists of running a clustering algorithm several times and obtaining different partitions, and the clustering partition that optimises the validity index under consideration is selected as the best partition.
Thus, the main goal of a cluster validity technique is to identify the partition of clusters for which a measure of quality is optimal. The recognition of these requirements in analysis of gene expression data led us to the development of Machaon CVE system.
The major functions of the system can be summarised as follows:. It offers some of the well-established clustering methods that are available in literature. Evaluation of the clustering scheme or cluster validation. The clustering methods can find a partition in a dataset, based on certain assumptions. Thus, an algorithm may result in different clustering schemes for a dataset assuming different parameter values. Machaon evaluates the results of clustering algorithms based on quality indices and selects the clustering scheme that best fits the data.
The definition of these indices is based on two fundamental criteria of clustering quality: cluster compactness and isolation. The software is implemented as a multi-window Java application, which allows working with different datasets, clustering and validation algorithms, and results simultaneously. The Machaon tool is a data mining system based on the framework described in the previous section.
The system provides the following services: 1 access to data, 2 implementation of clustering algorithms, 3 evaluation of clustering results, using cluster validity indices. The system supports several modifications of tabular data formats widely used by third-party clustering tools Herrero et al. The focus has been made on clustering quality assessment and visualization of data mining results. Clustering : Multiple clustering techniques may be applied to a dataset and the results may be easily compared.
The user may select one of the available clustering algorithms in order to define a partitioning for the dataset. Depending on the clustering algorithm, the user defines the values of its input parameters. The results of a hierarchical clustering can also be displayed using dendrograms. Every clustering result may be selected and validated across a number of parameterised validation methods. Cluster Validity : Selecting the validation task the system searches for the optimal parameters' values for a specific clustering algorithm so as to result in a clustering scheme that best fits our data.
The user selects the clustering algorithm and the input parameter based on which the validation task will be performed. Also, the range of input parameters values is defined. Several methods for measuring gene-to-gene or sample-to-sample , intercluster and intracluster distances can be used in any combination. This is important to research the influence of different distance metrics on both clustering and validation. Both clustering and validation results are represented as a two-level tree in the bottom of the corresponding dataset window.
Clustering indices are also displayed in additional columns of a dataset table. Every such column is associated with a single partition.
Apart from the clustering and validation results, the system shows, if known, the natural classification structure of the data, which allows comparisons against clustering results and validation analyses across natural classes. The clustering and validation methods included in Machaon CVE have been applied to gene expression datasets from recently published microarray studies: the leukemia dataset of Golub et al.
The software is implemented as multi-window Java application, which allows working with multiple datasets, algorithms and results simultaneously. The Main Window panel contains the menu and indicates the current working dataset Figure 1. Multiple DataSet Windows provide views on open datasets including expression data table and the Result Tree , which displays a list of all clustering and validation results obtained for corresponding dataset.
Each row of the table may contain either single sample or single gene data accompanied with cluster indices for each partitioning. Machaon CVE uses the textual tab-delimited data files described in Table 1.
The format provides a possibility of saving the clustering results within a dataset. The Number of rows and Number of columns indicate the numerical values of rows and columns in the expression table.
Bold entries indicate necessary records. The program can read files, which already contain the number of clusters datasets, which has already been clustered by other software tools. Thus, the user could apply the validation techniques to the data files, which are provided by other systems.
An example of the described format is shown in Table 2. Several different types of clustering are implemented in the software. They include: hierarchical clustering single, complete, average, centroid, average to centroids and Hausdorff linkages and non-hierarchical clustering such as the K-Means algorithm Everitt, Three types of metrics Euclidian, Manhattan and Chebychev distances could be used in clustering algorithms. Optional Row Normalization could be also applied to the microarray dataset.
To start the clustering calculation, the user may select a method from the submenu Clustering of the main menu. The Parameter Window will appear to select the clustering parameters described above.
For instance, hierarchical clustering may be applied to the leukemia dataset Figure 1. As soon as the calculation is completed, a new entry is being added to the Results Tree. The result of clustering is also indicated in the expression table as a new cluster indices column appended to the right of the table.
In the case of hierarchical clustering, a user may view the results as a dendrogram. There are six types of intercluster distances single, complete, average, centroid, average to centroids distances and Hausdorff metrics , three types of intracluster distances complete, average and centroid diameters and three types of metrics Euclidian, Manhattan and Chebychev distances that can be used with every method in any combination.
For further information on the description of the types of metrics the reader is referred to Bolshakova and Azuaje, To apply a validation technique, it is necessary to select the Cluster Set in the Result Tree first and then choose the validation method from the Validation submenu of the main menu.
Validation parameters may be adjusted using the Parameters Window and then the selected method may be executed. The result of validation is attached to clustering result node in the tree Figure 2. As a way of illustration, different validity indices are applied to the leukemia cluster sets to find the optimal partitioning. Let's apply C-index, Goodman—Kruskal, Silhouette, Dunn's and Davis—Bouldin with parameters: complete intercluster distance and complete intracluster diameter indices to the partitioning of the leukemia dataset number of clusters from 2 to 6 obtained by average linkage clustering.
The results of the validation are shown in Table 3 low values of the C-Index and the Davis—Bouldin index are indicative of strong clusters. A user may now browse through a Result Tree and compare different partitioning validity indices to determine, for example, optimal clustering parameters. In our case, as it is seen on Table 3 it may be concluded that the most appropriate partitioning for the leukemia dataset consists of two clusters, which is supported by all validation methods.
For instance, Table 4 contains results of validation by Dunn's index of the same B-cell lymphoma dataset partitioning by hierarchical method with different number of clusters and different linkage calculation algorithms used.
Hence, the researcher may conclude that Hausdorff linkage used in hierarchical method produces noticeably different partitioning. This paper describes a software tool Machaon CVE that offers multiple clustering and cluster validity methods for DNA microarray data analysis.
There are different commercial and non-commercial software packages and web applications available with implementations of different clustering methods, but they lack facilities for estimating the optimal number of clusters, as well as components for evaluating the quality of the clusters obtained. The Machaon CVE allows the application of various validation methods to multiple datasets, which may be clustered by third-party tools.
Five validation and two clustering techniques with various combination of gene-to-gene or sample-to-sample , intercluster and intracluster distances have been implemented in this system. The tool described in this paper will contribute to the evaluation of clustering outcome and the identification of optimal cluster partitions.
The estimation approach described represents an effective tool to support biomedical knowledge discovery in gene expression data analysis. Even though Machaon CVE was developed for DNA microarray expression analysis applications, it may be effectively used for clustering and validating other biomedical and physical data with no limitations.
An example originated from leukemia data. The format described in Table 1 is implemented to the data. Validity indices for expression clusters originating from leukemia data. Bold entries highlight the optimal number of clusters, n , predicted by each method. Dunn's validity indices for expression clusters originating from B-cell lymphoma data.
Bold entries highlight the optimal number of clusters, n , predicted by each hierarchical clustering method. Alizadeh, A. Nature — Azuaje, F.
Metrics details. The molecular mechanisms of CC cholangiocarcinoma oncogenesis and progression are poorly understood. This study aimed to determine the genome-wide expression of genes related to CC oncogenesis and sarcomatous transdifferentiation. Genes that were differentially expressed between CC cell lines or tissues and cultured normal biliary epithelial NBE cells were identified using DNA microarray technology. Expressions were validated in human CC tissues and cells. The expression of 12 proteins was validated in the CC cell lines by immunoblot analysis and immunohistochemical staining. The deregulation of oncogenes, tumor suppressor genes, and methylation-related genes may be useful in identifying molecular targets for CC diagnosis and prognosis.
Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. Ferris and L. Hester and D. Guastella and S.
Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. Ferris and L. Hester and D.
Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. DOI: Eisen and P. Spellman and P.
Classification, Clustering, and Data Analysis pp Cite as. The availability of microarray data caused a new interest in clustering and classification methods. DNA microarrays are likely to play an important role for diagnosis and prognosis in clinical practice. Using the example of gene expression of diffuse large B-cell lymphona I introduce and review proposals for the experimental design and pattern recognition problems of gene expression experiments, the supervised learning or classification problem, the unsupervised learning or clustering problem and the potential of improving prognostic models.
The papers that are presented in detail during lectures are highlighted in red. The other papers can be used for presentations. The papers highlighted in white are the ones I recommend for presentation. Lectures 1 - 8 from last year's course.