Image Quantification and Background 
The image scanning algorithm lays down a grid over the image, to identify
squares in which the probes should be, and then selects the brightest subsquare (usually 4x4,
4x5, or 5x5 pixels) as the probe. The scanning software reports the average
pixel intensity of this subsquare, and its standard deviation, which is usually
5% – 20% of the mean. 
MAS 5.0 computes local background in each of 16 squares, and then subtracts
a weighted combination of these background estimates from each probe intensity 

Description of Affymetrix Probe Sets 
The strength of the Affymetrix system is that multiple distinct oligonucleotide
probes on each chip represent every gene. However the signals from the different probes for the
same gene aren't the same; signals from individual probes for the same gene may differ, on the
same chip, by as much as two orders of magnitude (a factor of 100). See Figures 1 and 2. The
sequences are different, and the probes have different hybridization constants for their target:
the most important factor in signal intensity is C:G content. How do we combine signals from the
many probes for a gene, into a single estimate of the abundance of that gene? 

Figure 1. Images of probes from human GAPDH probe set extracted from an Affymetrix U95A
chip image. PM probes in top row; corresponding MM probes on bottom. Two probes
are bright, three others are moderately bright, the rest are dim. 

Figure 1B. Typical numbers from a GAPDH probe set. 

Figure 2. Graphical representation of intensities from a GAPDH probe set. PM values
appear in blue; MM in green. Vertical axis height represents 30,000. 

Normalization of Affymetrix Chips 
Why Normalize? 
Biologists have long experience coping with systematic variation between experimental
conditions (technical variation) that is unrelated to the biological differences they seek.
Normalization is the attempt to compensate for systematic technical differences between chips,
to see more clearly the systematic biological differences between samples. Differences in treatment
of two samples, especially in labelling and in hybridization, bias the relative measures on any
two chips. 
The performance of expression arrays can vary in more ways than measures
such as rtPCR. Normalization methods that have worked well for these types of measures
do not perform as well for microarray data. 
Affymetrix introduced a new approach
for their 133 series chips, using a set of 100 'housekeeping genes': the chips
are rescaled so the average values of these housekeeping genes are equal
across all chips. This is much better than using a single housekeeping gene,
and probably adequate for about 80% of chips in practice. 
Most approaches to normalizing expression levels assume that the overall
distribution of RNA numbers doesn't change much between samples, and that most individual
genes change very little across the conditions. This seems reasonable for most laboratory
treatments, although treatments affecting transcription apparatus have large systemic
effects, and malignant tumours often have dramatically different expression profiles. If
most genes are unchanged, then the mean transcript levels should be the same for each condition.
An even stronger version of this idea is that the distributions of gene abundances must
be similar. 
Statisticians use the term 'bias' to describe systematic errors, which
affect a large number of genes. eep in mind that normalization, like any form of data
'fiddling' adds noise (random error) to the expression measures. You never really identify
the true source or nature of a systemic bias; rather you identify some feature,
which correlates with the systematic error. When you 'correct' for that
feature, you are adding some error to those samples where the feature you have
observed doesn't correspond well with the true underlying source of bias.
Statisticians try to balance bias and noise, and their rule of thumb is that
it's better to undercorrect for systemic biases than to compensate fully. 

Normalization by Scaling and its Limitations 
The simplest approach to normalizing Affymetrix data is to rescale each chip
in an experiment to equalize the average (or total) signal intensity across all chips. The
reasoning behind this is that there should be with equal weights of RNA for all the samples;
if the sizes of the RNA molecules are comparable, the number of RNA molecules should also be
the roughly the same in each sample. Consequently, nearly the same number of labeled molecules
from each sample should hybridise to the arrays and, if all other conditions were equal, the
total hybridisation intensities summed over all elements in the arrays should be the same for
each sample. Of course, other conditions aren't equal; hence normalization restores this equality.
In practice, for a series of chips, define normalization constants C_{1} , C
_{2} ,…, by: 
,
and so on, 
where the numbers f_{i}^{gene} are the fluorescent
intensities measured for each probe on chip i. Select a common total intensity
K (eg. the average of the C_{i}'s). Then to normalize all the
chips to the common total intensity K, divide all fluorescent intensity
readings from chip i by C_{i}., and multiply by K. Variants of this
approach, scaling all probes on a chip by the chip's trimmed mean intensity, or by median
intensity, are widely available in commercial software. 
To do better, we examine in detail the relationships among replicate chips
(chips hybridized to the same sample). Figure 1 shows a scatter plot of probes from one pair of
chips; there is clearly a nonlinear relation among probes. Figure 2 shows plots of probe
distributions from a number of replicate chips on a log scale; these distributions have very
different shapes; on a log scale, applying a scaling transform to a chip, shifts its distribution
curve to the right or left, but doesn't change its shape. Figure 5A shows RatioIntensity plots
of pairs of Affymetrix chips hybridized with replicate samples (so all logratios should be 0);
a scaling transform will shift the RI plots up or down, without changing their configuration.
For perhaps 80% of chips, the relationship between probe distributions is close enough to linear
that a scaling transform will bring the chips into line. For the other 20% of cases something
different is needed. 

Figure 1. Plot of probe signals from two Affymetrix chips hybridized with identical mRNA samples.
The black straight line represents equality, while the blue curve is a spline fit through the
scatter plot. 

Figure 2. Density of PM probe signals on 23 different chips from GeneLogic spikein
experiment (Courtesy of Terry Speed) 

Normalization by MAS 5.0 
MAS5 does slightly better than scaling using linear regression. The procedure
is to construct a plot of each chip's probes against the corresponding probes on the baseline
chip; eliminate the highest 1% of probes (and for symmetry the lowest 1%), and fit a regression
line to the middle 98% of probes (ie. estimate slope and intercept: two parameters). Transform
the values in each probe, by subtracting the intercept, and dividing by the slope, so that the
regression line becomes the identity ( y = x ) line. 
A better twoparameter approach is to both rescale and shift the origin, in
order fit both the mean and the standard deviation of the probe distribution to the common mean
and standard deviation of all data. This seems to do somewhat better than regression, in reducing
noise (variation among replicate measures on the same sample), at the cost of (sometimes)
introducing a few negative values. 

Invariant Set Normalization 
Li and Wong introduced a method, where a large number of genes are
selected adhoc as references, rather than using a standard set of 'housekeeping genes'.
Their method assumes that there is a subset of unchanged genes, between any two samples.
Their method selects a subset of genes g_{1},…, g_{M},
whose probes: p_{1}, …, p_{K}, (K ~ 10000), occur in the same rank
order on each chip such that p_{1} < p_{2} < …< p_{K}
in both chips (an invariant set); then fits a nonparametric curve (running median) through
the points { (p_{1}^{(1)}, p_{1}^{(2)}),…,
(p_{K}^{(1)}, p_{K}^{(2)}) }. Ideally one would like a common
invariant set of reference genes across all chips, but in practice, only a very
few probes are in common rank order, or even close to that, across all chips. 

Quantile Normalization 
Terry Speed's group introduced a nonparametric procedure normalizing to a
synthetic chip. Their method assumes that the distribution of gene abundances is nearly
the same in all samples. For convenience they take the pooled distribution of probes on all
chips. Then to normalize each chip they compute for each value, the quantile of that value in
the distribution of probe intensities; they then transform the original value to that quantile's
value on the reference chip. In a formula, the transform is 
x_{norm} = F_{i}^{1}(F_{ref}(x))
, 
where F_{i} is the distribution function of chip i,
and F_{ref} is the distribution function of the reference chip. 

Figure 3. Schematic representation of quantile normalization: the value x, which is
the αth quantile of all probes on chip 1, is mapped to the value y, which is the
α quantile of the reference distribution F_{2}. 
If F_{i} and F_{ref} are fairly similar in shape, then in
practice this transform is not too different from a straight line, which is what a scaling
transform looks like; see Figure 4. However the transform is strong enough to cope with the
nonlinear ratiointensity relationships revealed in figure 5A; see figure 5B
after quantile transformation. 

Figure 4. The effects of quantile normalization
on raw probe values in three chips. Raw values are on xaxis, normalized values
on yaxis. Often this transform looks very much like a scaling transform
(nearly linear), but sometimes it is quite nonlinear. 
Figure A. Ratio Intensity Plot of all probes for four pairs of chips from GeneLogic
spikein experiment 
Figure B. As in A, after normalization by matching quantiles. Both figures courtesy of Terry
Speed 
This form of normalisation also reduces noise among replicate measures of the
same samples, compared to normalization by scaling, as show below in figure 6. 

Figure 6. Each dot represents one probe on an Affymetrix U95A chip. On the yaxis is the
ratio of variance across a set of replicates after quantile normalisation, divided by the variance
of the scalenormalized values. On the xaxis are the mean levels. Both axes on
log scale. 
The main drawback of this approach to normalization is the strong assumption
that the distributions of probe intensities are identical (even if individual probes differ in
their positions in the distribution). This is true for low abundance genes, and to a fairly
good approximation for genes of moderate abundance, but certainly not true for the few highabundance
genes, whose typical levels vary noticeably from sample to sample. 

Combining ProbeLevel Signals into Gene Abundance Estimates 
There has been considerable discussion over the appropriate algorithm for
constructing single expression estimates based on multipleprobe hybridization data. To date,
over a dozen different methods have been published, which aim to synthesize the different readings
from the various probes for a gene, into a single estimate of transcript abundance. Affymetrix
recently sponsored a conference on the topic <LINK>. 

Estimation by Affymetrix MicroArray Suite 
Affymetrix has upgraded their MicroArray Suite (MAS) software several times over
the short history of their product. MAS 4.0 was the standard until January 2002 and is still cited
in published papers. MAS 4 calculates a weighted average of the probepair differences (PM – MM)
for each probe pair representing a gene. MAS 5.0 improves in two important ways. First the
intensities are transformed to a logarithmic scale before the average is taken; this equalizes the
contribution of different probes. Secondly an estimate of background based on MM replaces MM itself
in the difference PMMM; this estimate is itself a weighted average of log probe pair differences:
log(PM_{j}/MM_{j}). The "Statistical Algorithm Description
Document" from Affymetrix, has more details. 
The idea of averaging different probe intensities for the same gene is seems
quite wrong. The individual probes in a probe set have very different hybridization kinetics.
Taking an average of their intensities, is like averaging the readings from scans taken at very
different settings. A good algorithm should compare information about probe characteristics, based
on the performance of each probe across chips, and use this to make a better estimate. These
principles are the basis of the multichip models. Affymetrix has seen the evidence, and they are
planning to release their own multichip model in 2004. MAS 4.0 and MAS 5.0 will soon be of
historic interest only. 
MultiChip Models 
A chemical motivation for multichip models comes from reasoning that the amount
of signal from one probe in a gene's probe set, should depend both on the amount of that gene in
the sample, and on the specific affinity of the probe for that gene's mRNA. The statistical
motivation for multichip models is that the signals from individual probes move in parallel across
a set of chips; the signals have roughly the same pattern across the different samples, as shown
in Figure 3. The animations of probe sets in dChip show
this quite compellingly. 

Figure 3. Probe signals from a spikein experiment. The concentrations are plotted along
the horizontal axis (log scale), and the probe signals are plotted on the
vertical axis (log scale). Each probe is represented by one color. The
different probe signals change in parallel. Image courtesy of Terry Speed 

A good estimation strategy estimates both the characteristics of the probe
(affinity for the target or avidity), and gene abundance. The simplest strategy is
the statistical linear twofactor model: that model assumes that the errors in each data point
have similar variances, and the two factors combine in a simple way. If the signal from each probe
is proportional to both probe affinity, and gene abundance, then it must be proportional to the
product of these factors multiplied together. Suppose for one target gene, the chip has a set of
probes p_{1},…,p_{k}; each probe p_{j} binds to the
target with affinity f_{k}. Suppose in each sample i the gene occurs in
amount a_{i}. Then the intensity of probe j on chip i should be
proportional to f_{k} a_{i}. See figure 4. 

Figure 4. Ideal linear model relationship among intensity (height of green bars),
abundance of transcript (a_{i}), and probe affinity (_{j}). 
In practice, the discrepancies between real data and the ideal model are
frequent outliers, far beyond the usual random fluctuations in signal intensities. These
may be due to scratches, or uneven heating, or other artefacts. Typically up to15% of
probes in an acceptable Affymetrix chip are outliers. Most standard methods to
fit data flounder badly on data with this many outliers. The robust approach is
to try to identify the outliers, and exclude or downweight them. 

Constant Variance – the Li and Wong Model and Critique 
Li and Wong originally suggested the model PM_{ij} – MM
_{ij} = f_{k} a_{i}_{ }+ ε_{ij}
, by analogy with MAS 4.0. They have found better fits with the model PM_{ij}
= f_{k} a_{i}_{ } + ε_{ij}, (PMonly).
LiWong assumes that the noise in all the probe measures is roughly same size.
In practice all biological measures exhibit intensitydependent noise. The effect of their
assumption is that probes with smaller variation are ignored, even though this variation may be
measuring real differences. Fortunately the bright probes are often the most specific, and it
does little harm to ignore the majority of probes, if the bright probes are good. They have tuned
their fitting procedure to try to reduce the emphasis on the very bright probes, but this has resulted
in often throwing out a good bright probe probe as an outlier. 

Figure 5. A probe set with values (represented by red lines) fitted to actual PMMM values
(in blue); from dChip. 
Proportional Variance – RMA 
This is largely the work of Terry Speed's group at Berkeley, especially Ben
Bolstad, and Rafael Irizarry. They work only with PM values, and ignore MM entirely.
They take a log transform of equation () and find 



With errors proportional to intensity in the original scale, the errors on the
log scale have constant variance. After background subtraction and normalization they fit: 



where nlog is their terminology for 'normalize and then take logarithm'. They
fit this model by iteratively reweighted least squares, or by median polish. Code is available in
the affy package on BioConductor, together with quantile normalization. 
At the moment RMA appears to be the best method available. Figure 6 compares
the performance of several algorithms on replicate arrays; the smaller the variability the better
the algorithm. Four strata of genes are shown: from lowest to highest expression. MAS5 apparently
does a decent job on high abundance genes, but the multichip models do better on lowabundance
genes, which include many genes of interest, such as transcription factors, and
signalling proteins. 

Figure 6. Comparison of MAS5 (green), dChip (black), RMA (blue), RMA (red): The genes have
been divided into quarters based on average expression. Each boxplot represents
the standard deviation of genes in one fraction. Note that the multichip
models do almost ten times better than MAS5 on the lowabundance genes; this
category includes most transcription factors and signalling proteins. 

Modelbased Quality Control of Affymetrix Chips 
One of the particular values of a multiprobe system is that all probes
effectively act like positive controls. Since the Affymetrix probes have such different response
characteristics, you don?t want to reject large or small probes, but with a good multichip model,
the hybridization problems show up as outliers from fitted multiprobe model. The outliers are
surprisingly good at showing problems in distinct areas of some chips. For example, the following
figure 7 was made from dChip, and shows outliers on one chip. These outliers mostly
occur in two rings. 

Figure 7. Outliers are show in pink 
Software Available 
Li and Wong's method is available through their program dChip, at www.dchip.org. Academic licenses are free. 
The RMA method is available as part of the affy package in the Bioconductor
tools suite: see www.bioconductor.org.
A commercial software vendor, Iobion, has incorporated RMA into their GeneTrafic
product. 