| This section is a more technical discussion about the
distribution of signal intensities, and transforms that may be useful. |
 |
| The first thing
to notice is that most genes are expressed at very low levels; few genes are
expressed at high copy number. In statistical jargon we say that the
distribution is skewed to the right. Statisticians often deal with highly
skewed data on a logarithmic scale; this transform also treats a fold-change
down on the same basis as fold-changes up; hence it is common now to apply this
transform to microarray data. However other transforms may be useful for
different purposes. Sometimes the distribution of intensities appears roughly
bell-shaped after a transform; however depending on how background is handled,
and how the low abundance genes are estimated, the distribution of intensities
from the microarray may appear skewed even on a log scale, as does this
example; sometimes the distribution appears double-peaked. |
| Although distributions
of this shape are common in microarray data, there is no reason to believe that
they reflect relative gene abundances within a sample. The signals from a
single microarray are not direct measures of copy number in the sample hybridised
to that array; rather the signal from each gene probe across arrays is
proportional to the copy number across arrays, but with a different proportionality
constant for each gene. In particular the rise on the left doesn't immediately
make sense, because we expect most genes to be expressed at very low copy
number, and fewer genes expressed at higher levels. More precise methods, such
as SAGE, regularly find the distribution of gene abundances to be close to an
inverse square power law: i.e. the number of genes with ten SAGE tags in a
sample is roughly 1/100th the number of single copy sage tags; and
the number of genes with 20 sage tags is roughly ¼ the number with ten. It
would be wrong to conclude from the graph of signal intensities that there are
a small number of genes with a very low abundance, and quite a lot that are
slightly more abundant. |
| The current
best explanation for the typical shape of the microarray signal distribution
shape is that the signal for each gene is a combination of the hybridization of
that gene, plus some non-specific hybridization, from all the other similar
sequences, or partial transcripts in the sample, plus noise: eg. dust
particles, other labelled transcripts binding to streaks of other probe, etc.
The amount of non-specific hybridization depends on the gene, but we think that
the distribution of non-specific hybridization has some bell-shaped
distribution (probably not normal). Actually the genes whose signals lie to the
left of the peak are below the typical background noise, which suggests they
are unreliable estimates of true gene abundance. |
| |
|
| |
Variation and Transforms |
| Let's
distinguish technical variability, the typical differences between repeated
measures on the same sample, from individual variation. Technical variation is
due to differences in sample preparation, the course of hybridization, and
other factors. This is usually what is called 'noise'. On top of that, different
(healthy) individuals have consistently different patterns of gene expression.
In experiments where groups consisting of several individuals are compared,
this variation may also be considered 'noise'. |
| A common
observation in biology is that noise increases with level. So the technical
variation in a measure of a housekeeping gene is higher than that of a
transcription factor. Most statistical procedures assume that all values being
compared have comparable noise levels; the p-values from these procedures will be
erroneous if there is great discrepancy in variability. Statisticians treat
this problem with transforms, and a common choice is the logarithm transform;
another common choice is a fractional power (eg. square root or cube root). The
most common transform in microarray is the logarithm transform, which also has
the attractive feature that fold-changes of any given size appear as shifts of
constant amount for all genes. It also compensates for the intensity dependent
noise, but actually over-compensates. Noise at the lower end is now higher than
noise at the upper end. |
 |
| |
| The plot on the left shows a scatter plot of the same genes on two chips.
The plot on the right shows standard deviations across chips as a function of the mean over all
chips. |
| The plot on
the left shows a scatter plot of the log intensities of genes on two chips. The
plot on the right shows standard deviations of the log intensities across chips
as a function of the log mean over all chips. |
| |
| At this point it is worth introducing a common device for displaying
the comparison between two samples: the ratio-intensity plot (R-I plot). This is most
convenient on a log scale because down-regulations (ratios lower than one are
represented symmetrically with ratios higher. |
| |
 |
| |
| We might
ask if there is a simple transform that makes noise comparable at all levels.
There are such transforms, but they are not simple. Several research groups
presented variance-stabilizing transforms in 2002. Their proposals are based on
a model where the noise for each gene has an additive component (perhaps
reflecting background), and a multiplicative component (reflecting
hybridization fluctuations). The simplest model is: |
| |
| and this gives rise to a simple transform of this form: |
 |
| Although in
principle one should be able to estimate the parameters empirically, in
practice you often get better results from other choices, and the groups have
published calibration algorithms. One practical advantage of displaying data on
scale is that straight lines on a scatterplot statistically significant
differences. |
| A simpler approach is to try both a logarithm transform, and a cube-root
transform; often one or the other will be almost as good as the variance-stabilizing
transform. |
| PIC |
| Figure. The effect of a cube-root transform on the chip vs chip plot (left)
and the SD vs mean level plot (right) |