Genomics and Bioinformatics Group Genomics and Bioinformatics Group Genomics and Bioinformatics Group
Genomics and Bioinformatics Group

Microarray Data Analysis

Genomics and Bioinformatics Group
   Home
  Publications
   Tools
   Data Sets
   Molec Maps
   μA Analysis
   Members
   Links
   Contact
   Search
 
The Distribution of Microarray Intensities

The Distribution of Microarray Intensities

This section is a more technical discussion about the distribution of signal intensities, and transforms that may be useful.
The first thing to notice is that most genes are expressed at very low levels; few genes are expressed at high copy number. In statistical jargon we say that the distribution is skewed to the right. Statisticians often deal with highly skewed data on a logarithmic scale; this transform also treats a fold-change down on the same basis as fold-changes up; hence it is common now to apply this transform to microarray data. However other transforms may be useful for different purposes. Sometimes the distribution of intensities appears roughly bell-shaped after a transform; however depending on how background is handled, and how the low abundance genes are estimated, the distribution of intensities from the microarray may appear skewed even on a log scale, as does this example; sometimes the distribution appears double-peaked.
Although distributions of this shape are common in microarray data, there is no reason to believe that they reflect relative gene abundances within a sample. The signals from a single microarray are not direct measures of copy number in the sample hybridised to that array; rather the signal from each gene probe across arrays is proportional to the copy number across arrays, but with a different proportionality constant for each gene. In particular the rise on the left doesn't immediately make sense, because we expect most genes to be expressed at very low copy number, and fewer genes expressed at higher levels. More precise methods, such as SAGE, regularly find the distribution of gene abundances to be close to an inverse square power law: i.e. the number of genes with ten SAGE tags in a sample is roughly 1/100th the number of single copy sage tags; and the number of genes with 20 sage tags is roughly ¼ the number with ten. It would be wrong to conclude from the graph of signal intensities that there are a small number of genes with a very low abundance, and quite a lot that are slightly more abundant.
The current best explanation for the typical shape of the microarray signal distribution shape is that the signal for each gene is a combination of the hybridization of that gene, plus some non-specific hybridization, from all the other similar sequences, or partial transcripts in the sample, plus noise: eg. dust particles, other labelled transcripts binding to streaks of other probe, etc. The amount of non-specific hybridization depends on the gene, but we think that the distribution of non-specific hybridization has some bell-shaped distribution (probably not normal). Actually the genes whose signals lie to the left of the peak are below the typical background noise, which suggests they are unreliable estimates of true gene abundance.
 
 

Variation and Transforms

Let's distinguish technical variability, the typical differences between repeated measures on the same sample, from individual variation. Technical variation is due to differences in sample preparation, the course of hybridization, and other factors. This is usually what is called 'noise'. On top of that, different (healthy) individuals have consistently different patterns of gene expression. In experiments where groups consisting of several individuals are compared, this variation may also be considered 'noise'.
A common observation in biology is that noise increases with level. So the technical variation in a measure of a housekeeping gene is higher than that of a transcription factor. Most statistical procedures assume that all values being compared have comparable noise levels; the p-values from these procedures will be erroneous if there is great discrepancy in variability. Statisticians treat this problem with transforms, and a common choice is the logarithm transform; another common choice is a fractional power (eg. square root or cube root). The most common transform in microarray is the logarithm transform, which also has the attractive feature that fold-changes of any given size appear as shifts of constant amount for all genes. It also compensates for the intensity dependent noise, but actually over-compensates. Noise at the lower end is now higher than noise at the upper end.
 
The plot on the left shows a scatter plot of the same genes on two chips. The plot on the right shows standard deviations across chips as a function of the mean over all chips.
The plot on the left shows a scatter plot of the log intensities of genes on two chips. The plot on the right shows standard deviations of the log intensities across chips as a function of the log mean over all chips.
 
At this point it is worth introducing a common device for displaying the comparison between two samples: the ratio-intensity plot (R-I plot). This is most convenient on a log scale because down-regulations (ratios lower than one are represented symmetrically with ratios higher.
 
 
We might ask if there is a simple transform that makes noise comparable at all levels. There are such transforms, but they are not simple. Several research groups presented variance-stabilizing transforms in 2002. Their proposals are based on a model where the noise for each gene has an additive component (perhaps reflecting background), and a multiplicative component (reflecting hybridization fluctuations). The simplest model is:
 
and this gives rise to a simple transform of this form:
Although in principle one should be able to estimate the parameters empirically, in practice you often get better results from other choices, and the groups have published calibration algorithms. One practical advantage of displaying data on scale is that straight lines on a scatterplot statistically significant differences.
A simpler approach is to try both a logarithm transform, and a cube-root transform; often one or the other will be almost as good as the variance-stabilizing transform.
PIC
Figure. The effect of a cube-root transform on the chip vs chip plot (left) and the SD vs mean level plot (right)

Genomics and Bioinformatics Group Home Page Link to Center for Cancer Research Home Page Link to National Cancer Institute Home Page Link to National Institutes of Health Link to Department of Health & Human Services Home Page