Genomics and Bioinformatics Group Genomics and Bioinformatics Group Genomics and Bioinformatics Group
Genomics and Bioinformatics Group

Microarray Data Analysis

Genomics and Bioinformatics Group
   Molec Maps
   LMP Home
Exploratory Analysis

Exploratory Analysis


Exploratory analysis aims to find patterns in the data that aren’t predicted by the experimenter’s current knowledge or pre-conceptions. Some typical goals are to identify groups of genes expression patterns across samples are closely related; or to find unknown subgroups among samples. A useful first step in all analyses is to identify outliers among samples – those that appear suspiciously far from others in their group. To address these questions, researchers have turned to methods such as cluster analysis, and principal components analysis, although these have often been used inappropriately.
The first widely publicized microarray studies aimed to find uncharacterised genes, which act at specific points during the cell cycle. Clustering is the natural first step in doing this. Unfortunately many people got the impression that clustering is the 'right' thing to do with microarray data; the confusion has been perpetuated, since many software packages have catered to this impression. The proper way to analyze data is the way that addresses the goal at which the study was aimed. Clustering is a useful exploratory technique for suggesting resemblances among groups of genes, but it’s not a way of identifying the differentially regulated genes in an experimental study.


After that disclaimer, suppose that we want to find groups of similar genes or similar samples, how do we go about it? Clustering depends on the idea that differences between gene expression profiles are like distances; however the user must make (somewhat arbitrary) choices to compute a single measure of distance from many individual differences. Different procedures emphasize different types of similarities, and give different resulting clusters. Four choices you have to make are:
  • what scale to use: original scale, log scale, or another transform,
  • whether to use all genes or to make a selection of genes,
  • what metric (distance measure) to use to combine the scaled values of the selected genes, and
  • what clustering algorithm to use.
It is not clear what is the most appropriate scale for multivariate exploratory techniques, such as clustering and PCA (see below). Differences measured on the linear scale will be strongly influenced by the one hundred or so highly expressed genes, and only moderately affected by the hundreds of moderate abundance genes; the thousands of low abundance genes will contribute little. Often the high-abundance genes are 'housekeeping' genes; these may or may not be diagnostic for the kinds of differences being sought. On the other hand, the log scale will amplify the noise among genes with low expression levels. If low-abundance genes are included (see below) then they should be down-weighted. In the author's opinion, the most useful measure of a single gene difference is the difference between two samples, relative to that gene's variability within experimental groups: this is like a t-score for difference between two individuals.
Gene Selection
It would be wise not to place much emphasis on genes whose values are uncertain. These are usually those with low signals in relation to noise, or which fail spot-level quality control. If the estimation software provides a measure of confidence in each gene estimate, this can be used to weight the contribution to distance of that gene overall. It's not wise to simply omit (that is, set to 0) distances which are not known accurately, but it is wise to down-weight relative distances if several are probably in error. A simple general rule is that genes whose signal falls within the background noise range are probably contributing just noise to your clustering (and any other global procedure); discard them.
Most cluster programs give you a menu of distance measures: Euclidean, Manhattan distances, and some relational measures: correlation, and sometimes relative distance, and mutual information. The names describe how differences are combined: Euclidean is straight-line distance: (root of sum of squares, as in geometry), Manhattan is sum of linear distances (like navigating in Manhattan). The correlation distance measure is actually 1-r, where r is the correlation coefficient. Probably a more useful version is 1 – |r|; negative correlation is as informative as positive correlation. The mutual information (MI) is defined in terms of entropy: H = Σp(x)log2(p(x))for a discrete distribution {p}. Then MI(g1,g2) = H(g1) + H(g2) – H(g1,g2) for genes g. This measure is robust – not affected by outliers. However it is tedious to program, because it requires adaptive binning to construct a meaningful discrete distribution.
By and large there are no theoretical reasons to pick one over the other, since we don't really know what we mean by 'distance' between expression profiles. The author prefers to use 'Manhattan&' metrics for clustering samples by similarity of gene expression levels, and to use a correlation measure to cluster genes. Most of these measures are fairly sensitive to outliers, except mutual information. Robust versions of these measures can easily be constructed by a competent statistician, but are not available in most software packages. However we do get different results depending on the algorithm we use, as shown below for a study with 10 samples: two normal samples and two groups of tumor samples.
Study with two tumor samples and two non-tumor samples
Clustering of the same data set using four different distance measures. All genes were on a logarithmic scale, and only genes with an MAS 5 'Present' call in 8 out of 10 samples were used (Affymetrix data). The four measures are listed in the titles; 'relative' is |x-y|/|x+y|.

Clustering Algorithms

Most biologists find hierarchical clustering more familiar, and other algorithms somewhat magical. Statisticians object to hierarchical clustering because it seems (falsely) to imply descent; however this is a quibble: all of the common clustering methods are based on models which don't really apply to microarray data. Broadly speaking, the differences between clustering methods show up in how ambiguous cases are assigned; if you have very many ambiguous cases you'll see great differences; if so, then maybe clustering isn't appropriate anyway, because the data don't separate into groups naturally. The k-means and SOM methods require the user to specify in advance the number of clusters to be constructed. Of course you don’t know ahead of time, most people end up trying out many values. A criterion that some people use to decide how many clusters to use is to track how much the intra-group variance drops at each addition of another cluster; then going back to the point where the rate of decrease really slows down. More advanced methods allow clusters to overlap, as often happens in biology, or to omit some outlying genes.
Statistical significance of clusters by bootstrapping
An important question, but rarely asked, is whether the clusters obtained from a procedure depend on the overall pattern of gene expression, or on a few samples; they could be very different if one or two samples are omitted. One approach to address this is the Bootstrap: you re-cluster many times, each time re-sampling conditions or genes from your original data, and then derive new clusters of genes or conditions. A variant is known as Jack-knife analysis. Branches in a hierarchical cluster that are supported by a large fraction of the re-sampled data sets are considered fairly reliable. A reasonable figure is 70%, but this is arbitrary, like the often-quoted 5% level of significance.
With any exploratory technique, one should think about what technical variable may underlie the groups discovered this way, before going to the lab to confirm findings. The author finds that clustering most often identifies systematic differences in collection procedures or experimental protocol. These are important but not biologically significant. Even when the difference is biological, it may not be a discovery. Most sets of breast cancer data segregate into ER+ and ER- in clustering, which is re-assuring but hardly news.

Principal Components and Multi-dimensional scaling

Several other good multivariate techniques can help with exploratory analysis. Many authors suggest principal components analysis (PCA) or singular value decomposition to find coherent patterns of genes, or ‘metagenes’, that discriminate groups. These techniques with a long history in the statistical arsenal rely on the idea that most variation in a data set can be explained by a smaller number of transformed variables; they each form linear combinations of the data, which represent most of the variation, and in principle these approaches, are well-suited for this purpose. However this author believes these approaches are delicate enough that they are not very often useful for deep exploratory work; often the largest coherent component, such as the first principal component (PC), reflects mostly systematic error in the data. In fact some researchers have seriously suggested normalizing microarray data, by removing the first PC. PCA is not terribly robust to outliers, which are common. Like cluster analysis, the results of PCA are sensitive to transforms, which are somewhat arbitrary.
These multivariate approaches are more useful for exploring relations among samples, and particularly for a diagnostic look at samples before formal statistical tests. Multi-dimensional scaling (MDS) makes the most useful graphical displays. Classical MDS is identical to PCA for most data sets; however if you fix a dimension, a modern iterative algorithm can place the points in an arrangement that is more representative of true distances, than are the same number of principle components. It’s worth getting the ‘strain’ parameter for the MDS fit: this parameter measures the discrepancy between the distances computed by the metric, and the distances represented in the picture. When this parameter is much over 15%, the picture can be misleading. Often if you omit outlying samples, or take a major subgroup, the picture is a more accurate representation of the computed distances.
When the data conform to expectation the MDS plot shows well-defined groups. Figure ** shows an MDS plot of seven groups in a comparative study; the control group is in black; three of the groups are quite similar to controls and each other, while three others are quite distinct from controls and each other. This experiment showed many genes under very clear regulation.
Multidimension Scale plot of diabetic mouse data
Things aren’t always so happy. The following figure ** shows a group of experiments which might have been misinterpreted. Replicate experiments comparing three treatments against three controls were done on two different dates. Results were not consistent between experiments, but the MDS plot shows that we can work with the day 2 data quite confidently, and separately from day 1. The cluster analysis on the left doesn’t show this nearly as clearly.
Cluster diagram vs multi-dimensional scaling
Representing two batches of 3 treatment (T) and 3 controls (C), done on two different dates: 1T3 represents 3rd treated sample on day one. We can see that the day 1 chips cluster together, and are displayed together by MDS. However the day 2 genes seem to fall into two distinct clusters, which don’t divide neatly along T and C. The MDS plot shows that the C samples on day 2 are in fact quite close, whereas the 3 T’s are more disparate but all quite different from the C’s.

Genomics and Bioinformatics Group Home Page Link to Center for Cancer Research Home Page Link to National Cancer Institute Home Page Link to National Institutes of Health Link to Department of Health & Human Services Home Page