Image Quantification and Background |
The image scanning algorithm lays down a grid over the image, to identify
squares in which the probes should be, and then selects the brightest subsquare (usually 4x4,
4x5, or 5x5 pixels) as the probe. The scanning software reports the average
pixel intensity of this subsquare, and its standard deviation, which is usually
5% – 20% of the mean. |
MAS 5.0 computes local background in each of 16 squares, and then subtracts
a weighted combination of these background estimates from each probe intensity |
|
Description of Affymetrix Probe Sets |
The strength of the Affymetrix system is that multiple distinct oligonucleotide
probes on each chip represent every gene. However the signals from the different probes for the
same gene aren't the same; signals from individual probes for the same gene may differ, on the
same chip, by as much as two orders of magnitude (a factor of 100). See Figures 1 and 2. The
sequences are different, and the probes have different hybridization constants for their target:
the most important factor in signal intensity is C:G content. How do we combine signals from the
many probes for a gene, into a single estimate of the abundance of that gene? |
|
Figure 1. Images of probes from human GAPDH probe set extracted from an Affymetrix U95A
chip image. PM probes in top row; corresponding MM probes on bottom. Two probes
are bright, three others are moderately bright, the rest are dim. |
|
Figure 1B. Typical numbers from a GAPDH probe set. |
|
Figure 2. Graphical representation of intensities from a GAPDH probe set. PM values
appear in blue; MM in green. Vertical axis height represents 30,000. |
|
Normalization of Affymetrix Chips |
Why Normalize? |
Biologists have long experience coping with systematic variation between experimental
conditions (technical variation) that is unrelated to the biological differences they seek.
Normalization is the attempt to compensate for systematic technical differences between chips,
to see more clearly the systematic biological differences between samples. Differences in treatment
of two samples, especially in labelling and in hybridization, bias the relative measures on any
two chips. |
The performance of expression arrays can vary in more ways than measures
such as rt-PCR. Normalization methods that have worked well for these types of measures
do not perform as well for microarray data. |
Affymetrix introduced a new approach
for their 133 series chips, using a set of 100 'housekeeping genes': the chips
are re-scaled so the average values of these housekeeping genes are equal
across all chips. This is much better than using a single housekeeping gene,
and probably adequate for about 80% of chips in practice. |
Most approaches to normalizing expression levels assume that the overall
distribution of RNA numbers doesn't change much between samples, and that most individual
genes change very little across the conditions. This seems reasonable for most laboratory
treatments, although treatments affecting transcription apparatus have large systemic
effects, and malignant tumours often have dramatically different expression profiles. If
most genes are unchanged, then the mean transcript levels should be the same for each condition.
An even stronger version of this idea is that the distributions of gene abundances must
be similar. |
Statisticians use the term 'bias' to describe systematic errors, which
affect a large number of genes. eep in mind that normalization, like any form of data
'fiddling' adds noise (random error) to the expression measures. You never really identify
the true source or nature of a systemic bias; rather you identify some feature,
which correlates with the systematic error. When you 'correct' for that
feature, you are adding some error to those samples where the feature you have
observed doesn't correspond well with the true underlying source of bias.
Statisticians try to balance bias and noise, and their rule of thumb is that
it's better to under-correct for systemic biases than to compensate fully. |
|
Normalization by Scaling and its Limitations |
The simplest approach to normalizing Affymetrix data is to re-scale each chip
in an experiment to equalize the average (or total) signal intensity across all chips. The
reasoning behind this is that there should be with equal weights of RNA for all the samples;
if the sizes of the RNA molecules are comparable, the number of RNA molecules should also be
the roughly the same in each sample. Consequently, nearly the same number of labeled molecules
from each sample should hybridise to the arrays and, if all other conditions were equal, the
total hybridisation intensities summed over all elements in the arrays should be the same for
each sample. Of course, other conditions aren't equal; hence normalization restores this equality.
In practice, for a series of chips, define normalization constants C1 , C
2 ,…, by: |
,
and so on, |
where the numbers figene are the fluorescent
intensities measured for each probe on chip i. Select a common total intensity
K (eg. the average of the Ci's). Then to normalize all the
chips to the common total intensity K, divide all fluorescent intensity
readings from chip i by Ci., and multiply by K. Variants of this
approach, scaling all probes on a chip by the chip's trimmed mean intensity, or by median
intensity, are widely available in commercial software. |
To do better, we examine in detail the relationships among replicate chips
(chips hybridized to the same sample). Figure 1 shows a scatter plot of probes from one pair of
chips; there is clearly a non-linear relation among probes. Figure 2 shows plots of probe
distributions from a number of replicate chips on a log scale; these distributions have very
different shapes; on a log scale, applying a scaling transform to a chip, shifts its distribution
curve to the right or left, but doesn't change its shape. Figure 5A shows Ratio-Intensity plots
of pairs of Affymetrix chips hybridized with replicate samples (so all log-ratios should be 0);
a scaling transform will shift the R-I plots up or down, without changing their configuration.
For perhaps 80% of chips, the relationship between probe distributions is close enough to linear
that a scaling transform will bring the chips into line. For the other 20% of cases something
different is needed. |
|
Figure 1. Plot of probe signals from two Affymetrix chips hybridized with identical mRNA samples.
The black straight line represents equality, while the blue curve is a spline fit through the
scatter plot. |
|
Figure 2. Density of PM probe signals on 23 different chips from GeneLogic spike-in
experiment (Courtesy of Terry Speed) |
|
Normalization by MAS 5.0 |
MAS5 does slightly better than scaling using linear regression. The procedure
is to construct a plot of each chip's probes against the corresponding probes on the baseline
chip; eliminate the highest 1% of probes (and for symmetry the lowest 1%), and fit a regression
line to the middle 98% of probes (ie. estimate slope and intercept: two parameters). Transform
the values in each probe, by subtracting the intercept, and dividing by the slope, so that the
regression line becomes the identity ( y = x ) line. |
A better two-parameter approach is to both re-scale and shift the origin, in
order fit both the mean and the standard deviation of the probe distribution to the common mean
and standard deviation of all data. This seems to do somewhat better than regression, in reducing
noise (variation among replicate measures on the same sample), at the cost of (sometimes)
introducing a few negative values. |
|
Invariant Set Normalization |
Li and Wong introduced a method, where a large number of genes are
selected ad-hoc as references, rather than using a standard set of 'housekeeping genes'.
Their method assumes that there is a subset of unchanged genes, between any two samples.
Their method selects a subset of genes g1,…, gM,
whose probes: p1, …, pK, (K ~ 10000), occur in the same rank
order on each chip such that p1 < p2 < …< pK
in both chips (an invariant set); then fits a non-parametric curve (running median) through
the points { (p1(1), p1(2)),…,
(pK(1), pK(2)) }. Ideally one would like a common
invariant set of reference genes across all chips, but in practice, only a very
few probes are in common rank order, or even close to that, across all chips. |
|
Quantile Normalization |
Terry Speed's group introduced a non-parametric procedure normalizing to a
synthetic chip. Their method assumes that the distribution of gene abundances is nearly
the same in all samples. For convenience they take the pooled distribution of probes on all
chips. Then to normalize each chip they compute for each value, the quantile of that value in
the distribution of probe intensities; they then transform the original value to that quantile's
value on the reference chip. In a formula, the transform is |
xnorm = Fi-1(Fref(x))
, |
where Fi is the distribution function of chip i,
and Fref is the distribution function of the reference chip. |
|
Figure 3. Schematic representation of quantile normalization: the value x, which is
the α-th quantile of all probes on chip 1, is mapped to the value y, which is the
α quantile of the reference distribution F2. |
If Fi and Fref are fairly similar in shape, then in
practice this transform is not too different from a straight line, which is what a scaling
transform looks like; see Figure 4. However the transform is strong enough to cope with the
non-linear ratio-intensity relationships revealed in figure 5A; see figure 5B
after quantile transformation. |
|
Figure 4. The effects of quantile normalization
on raw probe values in three chips. Raw values are on x-axis, normalized values
on y-axis. Often this transform looks very much like a scaling transform
(nearly linear), but sometimes it is quite non-linear. |
Figure A. Ratio Intensity Plot of all probes for four pairs of chips from GeneLogic
spike-in experiment |
Figure B. As in A, after normalization by matching quantiles. Both figures courtesy of Terry
Speed |
This form of normalisation also reduces noise among replicate measures of the
same samples, compared to normalization by scaling, as show below in figure 6. |
|
Figure 6. Each dot represents one probe on an Affymetrix U95A chip. On the y-axis is the
ratio of variance across a set of replicates after quantile normalisation, divided by the variance
of the scale-normalized values. On the x-axis are the mean levels. Both axes on
log scale. |
The main drawback of this approach to normalization is the strong assumption
that the distributions of probe intensities are identical (even if individual probes differ in
their positions in the distribution). This is true for low abundance genes, and to a fairly
good approximation for genes of moderate abundance, but certainly not true for the few high-abundance
genes, whose typical levels vary noticeably from sample to sample. |
|
Combining Probe-Level Signals into Gene Abundance Estimates |
There has been considerable discussion over the appropriate algorithm for
constructing single expression estimates based on multiple-probe hybridization data. To date,
over a dozen different methods have been published, which aim to synthesize the different readings
from the various probes for a gene, into a single estimate of transcript abundance. Affymetrix
recently sponsored a conference on the topic <LINK>. |
|
Estimation by Affymetrix MicroArray Suite |
Affymetrix has upgraded their MicroArray Suite (MAS) software several times over
the short history of their product. MAS 4.0 was the standard until January 2002 and is still cited
in published papers. MAS 4 calculates a weighted average of the probe-pair differences (PM – MM)
for each probe pair representing a gene. MAS 5.0 improves in two important ways. First the
intensities are transformed to a logarithmic scale before the average is taken; this equalizes the
contribution of different probes. Secondly an estimate of background based on MM replaces MM itself
in the difference PM-MM; this estimate is itself a weighted average of log probe pair differences:
log(PMj/MMj). The "Statistical Algorithm Description
Document" from Affymetrix, has more details. |
The idea of averaging different probe intensities for the same gene is seems
quite wrong. The individual probes in a probe set have very different hybridization kinetics.
Taking an average of their intensities, is like averaging the readings from scans taken at very
different settings. A good algorithm should compare information about probe characteristics, based
on the performance of each probe across chips, and use this to make a better estimate. These
principles are the basis of the multi-chip models. Affymetrix has seen the evidence, and they are
planning to release their own multi-chip model in 2004. MAS 4.0 and MAS 5.0 will soon be of
historic interest only. |
Multi-Chip Models |
A chemical motivation for multi-chip models comes from reasoning that the amount
of signal from one probe in a gene's probe set, should depend both on the amount of that gene in
the sample, and on the specific affinity of the probe for that gene's mRNA. The statistical
motivation for multi-chip models is that the signals from individual probes move in parallel across
a set of chips; the signals have roughly the same pattern across the different samples, as shown
in Figure 3. The animations of probe sets in dChip show
this quite compellingly. |
|
Figure 3. Probe signals from a spike-in experiment. The concentrations are plotted along
the horizontal axis (log scale), and the probe signals are plotted on the
vertical axis (log scale). Each probe is represented by one color. The
different probe signals change in parallel. Image courtesy of Terry Speed |
|
A good estimation strategy estimates both the characteristics of the probe
(affinity for the target or avidity), and gene abundance. The simplest strategy is
the statistical linear two-factor model: that model assumes that the errors in each data point
have similar variances, and the two factors combine in a simple way. If the signal from each probe
is proportional to both probe affinity, and gene abundance, then it must be proportional to the
product of these factors multiplied together. Suppose for one target gene, the chip has a set of
probes p1,…,pk; each probe pj binds to the
target with affinity fk. Suppose in each sample i the gene occurs in
amount ai. Then the intensity of probe j on chip i should be
proportional to fk ai. See figure 4. |
|
Figure 4. Ideal linear model relationship among intensity (height of green bars),
abundance of transcript (ai), and probe affinity (j). |
In practice, the discrepancies between real data and the ideal model are
frequent outliers, far beyond the usual random fluctuations in signal intensities. These
may be due to scratches, or uneven heating, or other artefacts. Typically up to15% of
probes in an acceptable Affymetrix chip are outliers. Most standard methods to
fit data flounder badly on data with this many outliers. The robust approach is
to try to identify the outliers, and exclude or down-weight them. |
|
Constant Variance – the Li and Wong Model and Critique |
Li and Wong originally suggested the model PMij – MM
ij = fk ai + εij
, by analogy with MAS 4.0. They have found better fits with the model PMij
= fk ai + εij, (PM-only).
Li-Wong assumes that the noise in all the probe measures is roughly same size.
In practice all biological measures exhibit intensity-dependent noise. The effect of their
assumption is that probes with smaller variation are ignored, even though this variation may be
measuring real differences. Fortunately the bright probes are often the most specific, and it
does little harm to ignore the majority of probes, if the bright probes are good. They have tuned
their fitting procedure to try to reduce the emphasis on the very bright probes, but this has resulted
in often throwing out a good bright probe probe as an outlier. |
|
Figure 5. A probe set with values (represented by red lines) fitted to actual PM-MM values
(in blue); from dChip. |
Proportional Variance – RMA |
This is largely the work of Terry Speed's group at Berkeley, especially Ben
Bolstad, and Rafael Irizarry. They work only with PM values, and ignore MM entirely.
They take a log transform of equation () and find |
|
 |
|
With errors proportional to intensity in the original scale, the errors on the
log scale have constant variance. After background subtraction and normalization they fit: |
|
 |
|
where nlog is their terminology for 'normalize and then take logarithm'. They
fit this model by iteratively re-weighted least squares, or by median polish. Code is available in
the affy package on BioConductor, together with quantile normalization. |
At the moment RMA appears to be the best method available. Figure 6 compares
the performance of several algorithms on replicate arrays; the smaller the variability the better
the algorithm. Four strata of genes are shown: from lowest to highest expression. MAS5 apparently
does a decent job on high abundance genes, but the multi-chip models do better on low-abundance
genes, which include many genes of interest, such as transcription factors, and
signalling proteins. |
|
Figure 6. Comparison of MAS5 (green), dChip (black), RMA (blue), RMA (red): The genes have
been divided into quarters based on average expression. Each boxplot represents
the standard deviation of genes in one fraction. Note that the multi-chip
models do almost ten times better than MAS5 on the low-abundance genes; this
category includes most transcription factors and signalling proteins. |
|
Model-based Quality Control of Affymetrix Chips |
One of the particular values of a multi-probe system is that all probes
effectively act like positive controls. Since the Affymetrix probes have such different response
characteristics, you don?t want to reject large or small probes, but with a good multi-chip model,
the hybridization problems show up as outliers from fitted multi-probe model. The outliers are
surprisingly good at showing problems in distinct areas of some chips. For example, the following
figure 7 was made from dChip, and shows outliers on one chip. These outliers mostly
occur in two rings. |
|
Figure 7. Outliers are show in pink |
Software Available |
Li and Wong's method is available through their program dChip, at www.dchip.org. Academic licenses are free. |
The RMA method is available as part of the affy package in the Bioconductor
tools suite: see www.bioconductor.org.
A commercial software vendor, Iobion, has incorporated RMA into their GeneTrafic
product. |