Genomics and Bioinformatics Group Genomics and Bioinformatics Group Genomics and Bioinformatics Group
Genomics and Bioinformatics Group

Microarray Data Analysis

Genomics and Bioinformatics Group
  Molec Maps

Low-Level Analysis of Affymetrix Chips

Image Quantification and Background

The image scanning algorithm lays down a grid over the image, to identify squares in which the probes should be, and then selects the brightest subsquare (usually 4x4, 4x5, or 5x5 pixels) as the probe. The scanning software reports the average pixel intensity of this subsquare, and its standard deviation, which is usually 5% – 20% of the mean.
MAS 5.0 computes local background in each of 16 squares, and then subtracts a weighted combination of these background estimates from each probe intensity

Description of Affymetrix Probe Sets

The strength of the Affymetrix system is that multiple distinct oligonucleotide probes on each chip represent every gene. However the signals from the different probes for the same gene aren't the same; signals from individual probes for the same gene may differ, on the same chip, by as much as two orders of magnitude (a factor of 100). See Figures 1 and 2. The sequences are different, and the probes have different hybridization constants for their target: the most important factor in signal intensity is C:G content. How do we combine signals from the many probes for a gene, into a single estimate of the abundance of that gene?
Human GAPDH probes from Affymetrix U95A chip image
Figure 1. Images of probes from human GAPDH probe set extracted from an Affymetrix U95A chip image. PM probes in top row; corresponding MM probes on bottom. Two probes are bright, three others are moderately bright, the rest are dim.
Figure 1B. Typical numbers from a GAPDH probe set.
Figure 2. Graphical representation of intensities from a GAPDH probe set. PM values appear in blue; MM in green. Vertical axis height represents 30,000.

Normalization of Affymetrix Chips

Why Normalize?
Biologists have long experience coping with systematic variation between experimental conditions (technical variation) that is unrelated to the biological differences they seek. Normalization is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Differences in treatment of two samples, especially in labelling and in hybridization, bias the relative measures on any two chips.
The performance of expression arrays can vary in more ways than measures such as rt-PCR. Normalization methods that have worked well for these types of measures do not perform as well for microarray data.
Affymetrix introduced a new approach for their 133 series chips, using a set of 100 'housekeeping genes': the chips are re-scaled so the average values of these housekeeping genes are equal across all chips. This is much better than using a single housekeeping gene, and probably adequate for about 80% of chips in practice.
Most approaches to normalizing expression levels assume that the overall distribution of RNA numbers doesn't change much between samples, and that most individual genes change very little across the conditions. This seems reasonable for most laboratory treatments, although treatments affecting transcription apparatus have large systemic effects, and malignant tumours often have dramatically different expression profiles. If most genes are unchanged, then the mean transcript levels should be the same for each condition. An even stronger version of this idea is that the distributions of gene abundances must be similar.
Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes. eep in mind that normalization, like any form of data 'fiddling' adds noise (random error) to the expression measures. You never really identify the true source or nature of a systemic bias; rather you identify some feature, which correlates with the systematic error. When you 'correct' for that feature, you are adding some error to those samples where the feature you have observed doesn't correspond well with the true underlying source of bias. Statisticians try to balance bias and noise, and their rule of thumb is that it's better to under-correct for systemic biases than to compensate fully.
Normalization by Scaling and its Limitations
The simplest approach to normalizing Affymetrix data is to re-scale each chip in an experiment to equalize the average (or total) signal intensity across all chips. The reasoning behind this is that there should be with equal weights of RNA for all the samples; if the sizes of the RNA molecules are comparable, the number of RNA molecules should also be the roughly the same in each sample. Consequently, nearly the same number of labeled molecules from each sample should hybridise to the arrays and, if all other conditions were equal, the total hybridisation intensities summed over all elements in the arrays should be the same for each sample. Of course, other conditions aren't equal; hence normalization restores this equality. In practice, for a series of chips, define normalization constants C1 , C 2 ,…, by:
Nomalization constant is the sum of all flourescant intensities on a given chip, and so on,
where the numbers figene are the fluorescent intensities measured for each probe on chip i. Select a common total intensity K (eg. the average of the Ci's). Then to normalize all the chips to the common total intensity K, divide all fluorescent intensity readings from chip i by Ci., and multiply by K. Variants of this approach, scaling all probes on a chip by the chip's trimmed mean intensity, or by median intensity, are widely available in commercial software.
To do better, we examine in detail the relationships among replicate chips (chips hybridized to the same sample). Figure 1 shows a scatter plot of probes from one pair of chips; there is clearly a non-linear relation among probes. Figure 2 shows plots of probe distributions from a number of replicate chips on a log scale; these distributions have very different shapes; on a log scale, applying a scaling transform to a chip, shifts its distribution curve to the right or left, but doesn't change its shape. Figure 5A shows Ratio-Intensity plots of pairs of Affymetrix chips hybridized with replicate samples (so all log-ratios should be 0); a scaling transform will shift the R-I plots up or down, without changing their configuration. For perhaps 80% of chips, the relationship between probe distributions is close enough to linear that a scaling transform will bring the chips into line. For the other 20% of cases something different is needed.
Plot of two probe signals from two Affymetrix chips
Figure 1. Plot of probe signals from two Affymetrix chips hybridized with identical mRNA samples. The black straight line represents equality, while the blue curve is a spline fit through the scatter plot.
Density of 23 PM probe signals
Figure 2. Density of PM probe signals on 23 different chips from GeneLogic spike-in experiment (Courtesy of Terry Speed)
Normalization by MAS 5.0
MAS5 does slightly better than scaling using linear regression. The procedure is to construct a plot of each chip's probes against the corresponding probes on the baseline chip; eliminate the highest 1% of probes (and for symmetry the lowest 1%), and fit a regression line to the middle 98% of probes (ie. estimate slope and intercept: two parameters). Transform the values in each probe, by subtracting the intercept, and dividing by the slope, so that the regression line becomes the identity ( y = x ) line.
A better two-parameter approach is to both re-scale and shift the origin, in order fit both the mean and the standard deviation of the probe distribution to the common mean and standard deviation of all data. This seems to do somewhat better than regression, in reducing noise (variation among replicate measures on the same sample), at the cost of (sometimes) introducing a few negative values.
Invariant Set Normalization
Li and Wong introduced a method, where a large number of genes are selected ad-hoc as references, rather than using a standard set of 'housekeeping genes'. Their method assumes that there is a subset of unchanged genes, between any two samples. Their method selects a subset of genes g1,…, gM, whose probes: p1, …, pK, (K ~ 10000), occur in the same rank order on each chip such that p1 < p2 < …< pK in both chips (an invariant set); then fits a non-parametric curve (running median) through the points { (p1(1), p1(2)),…, (pK(1), pK(2)) }. Ideally one would like a common invariant set of reference genes across all chips, but in practice, only a very few probes are in common rank order, or even close to that, across all chips.
Quantile Normalization
Terry Speed's group introduced a non-parametric procedure normalizing to a synthetic chip. Their method assumes that the distribution of gene abundances is nearly the same in all samples. For convenience they take the pooled distribution of probes on all chips. Then to normalize each chip they compute for each value, the quantile of that value in the distribution of probe intensities; they then transform the original value to that quantile's value on the reference chip. In a formula, the transform is
xnorm = Fi-1(Fref(x)) ,
where Fi is the distribution function of chip i, and Fref is the distribution function of the reference chip.
Schemantic representation of quantile normalization
Figure 3. Schematic representation of quantile normalization: the value x, which is the α-th quantile of all probes on chip 1, is mapped to the value y, which is the α quantile of the reference distribution F2.
If Fi and Fref are fairly similar in shape, then in practice this transform is not too different from a straight line, which is what a scaling transform looks like; see Figure 4. However the transform is strong enough to cope with the non-linear ratio-intensity relationships revealed in figure 5A; see figure 5B after quantile transformation.
Effects of quantile normalization on raw probe values in three chips
Figure 4. The effects of quantile normalization on raw probe values in three chips. Raw values are on x-axis, normalized values on y-axis. Often this transform looks very much like a scaling transform (nearly linear), but sometimes it is quite non-linear.
Ratio Intensity Plot of all probes in chips from GeneLogic spike-in
Figure A. Ratio Intensity Plot of all probes for four pairs of chips from GeneLogic spike-in experiment
Normalization after matching quantiles
Figure B. As in A, after normalization by matching quantiles. Both figures courtesy of Terry Speed
This form of normalisation also reduces noise among replicate measures of the same samples, compared to normalization by scaling, as show below in figure 6.
Graph of the variation levels
Figure 6. Each dot represents one probe on an Affymetrix U95A chip. On the y-axis is the ratio of variance across a set of replicates after quantile normalisation, divided by the variance of the scale-normalized values. On the x-axis are the mean levels. Both axes on log scale.
The main drawback of this approach to normalization is the strong assumption that the distributions of probe intensities are identical (even if individual probes differ in their positions in the distribution). This is true for low abundance genes, and to a fairly good approximation for genes of moderate abundance, but certainly not true for the few high-abundance genes, whose typical levels vary noticeably from sample to sample.

Combining Probe-Level Signals into Gene Abundance Estimates

There has been considerable discussion over the appropriate algorithm for constructing single expression estimates based on multiple-probe hybridization data. To date, over a dozen different methods have been published, which aim to synthesize the different readings from the various probes for a gene, into a single estimate of transcript abundance. Affymetrix recently sponsored a conference on the topic <LINK>.
Estimation by Affymetrix MicroArray Suite
Affymetrix has upgraded their MicroArray Suite (MAS) software several times over the short history of their product. MAS 4.0 was the standard until January 2002 and is still cited in published papers. MAS 4 calculates a weighted average of the probe-pair differences (PM – MM) for each probe pair representing a gene. MAS 5.0 improves in two important ways. First the intensities are transformed to a logarithmic scale before the average is taken; this equalizes the contribution of different probes. Secondly an estimate of background based on MM replaces MM itself in the difference PM-MM; this estimate is itself a weighted average of log probe pair differences: log(PMj/MMj). The "Statistical Algorithm Description Document" from Affymetrix, has more details.
The idea of averaging different probe intensities for the same gene is seems quite wrong. The individual probes in a probe set have very different hybridization kinetics. Taking an average of their intensities, is like averaging the readings from scans taken at very different settings. A good algorithm should compare information about probe characteristics, based on the performance of each probe across chips, and use this to make a better estimate. These principles are the basis of the multi-chip models. Affymetrix has seen the evidence, and they are planning to release their own multi-chip model in 2004. MAS 4.0 and MAS 5.0 will soon be of historic interest only.
Multi-Chip Models
A chemical motivation for multi-chip models comes from reasoning that the amount of signal from one probe in a gene's probe set, should depend both on the amount of that gene in the sample, and on the specific affinity of the probe for that gene's mRNA. The statistical motivation for multi-chip models is that the signals from individual probes move in parallel across a set of chips; the signals have roughly the same pattern across the different samples, as shown in Figure 3. The animations of probe sets in dChip show this quite compellingly.
probe signals from a spike-in environment
Figure 3. Probe signals from a spike-in experiment. The concentrations are plotted along the horizontal axis (log scale), and the probe signals are plotted on the vertical axis (log scale). Each probe is represented by one color. The different probe signals change in parallel. Image courtesy of Terry Speed
A good estimation strategy estimates both the characteristics of the probe (affinity for the target or avidity), and gene abundance. The simplest strategy is the statistical linear two-factor model: that model assumes that the errors in each data point have similar variances, and the two factors combine in a simple way. If the signal from each probe is proportional to both probe affinity, and gene abundance, then it must be proportional to the product of these factors multiplied together. Suppose for one target gene, the chip has a set of probes p1,…,pk; each probe pj binds to the target with affinity fk. Suppose in each sample i the gene occurs in amount ai. Then the intensity of probe j on chip i should be proportional to fk ai. See figure 4.
Ideal linear model between intensity, abundance of transcript, and probe affinity
Figure 4. Ideal linear model relationship among intensity (height of green bars), abundance of transcript (ai), and probe affinity (j).
In practice, the discrepancies between real data and the ideal model are frequent outliers, far beyond the usual random fluctuations in signal intensities. These may be due to scratches, or uneven heating, or other artefacts. Typically up to15% of probes in an acceptable Affymetrix chip are outliers. Most standard methods to fit data flounder badly on data with this many outliers. The robust approach is to try to identify the outliers, and exclude or down-weight them.
Constant Variance – the Li and Wong Model and Critique
Li and Wong originally suggested the model PMij – MM ij = fk ai + εij , by analogy with MAS 4.0. They have found better fits with the model PMij = fk ai + εij, (PM-only). Li-Wong assumes that the noise in all the probe measures is roughly same size. In practice all biological measures exhibit intensity-dependent noise. The effect of their assumption is that probes with smaller variation are ignored, even though this variation may be measuring real differences. Fortunately the bright probes are often the most specific, and it does little harm to ignore the majority of probes, if the bright probes are good. They have tuned their fitting procedure to try to reduce the emphasis on the very bright probes, but this has resulted in often throwing out a good bright probe probe as an outlier.
Probe set fitted to PM-MM values
Figure 5. A probe set with values (represented by red lines) fitted to actual PM-MM values (in blue); from dChip.
Proportional Variance – RMA
This is largely the work of Terry Speed's group at Berkeley, especially Ben Bolstad, and Rafael Irizarry. They work only with PM values, and ignore MM entirely. They take a log transform of equation () and find
With errors proportional to intensity in the original scale, the errors on the log scale have constant variance. After background subtraction and normalization they fit:
where nlog is their terminology for 'normalize and then take logarithm'. They fit this model by iteratively re-weighted least squares, or by median polish. Code is available in the affy package on BioConductor, together with quantile normalization.
At the moment RMA appears to be the best method available. Figure 6 compares the performance of several algorithms on replicate arrays; the smaller the variability the better the algorithm. Four strata of genes are shown: from lowest to highest expression. MAS5 apparently does a decent job on high abundance genes, but the multi-chip models do better on low-abundance genes, which include many genes of interest, such as transcription factors, and signalling proteins.
Comparison of MAS5, dChip, RMA, RMA
Figure 6. Comparison of MAS5 (green), dChip (black), RMA (blue), RMA (red): The genes have been divided into quarters based on average expression. Each boxplot represents the standard deviation of genes in one fraction. Note that the multi-chip models do almost ten times better than MAS5 on the low-abundance genes; this category includes most transcription factors and signalling proteins.

Model-based Quality Control of Affymetrix Chips

One of the particular values of a multi-probe system is that all probes effectively act like positive controls. Since the Affymetrix probes have such different response characteristics, you don?t want to reject large or small probes, but with a good multi-chip model, the hybridization problems show up as outliers from fitted multi-probe model. The outliers are surprisingly good at showing problems in distinct areas of some chips. For example, the following figure 7 was made from dChip, and shows outliers on one chip. These outliers mostly occur in two rings.
Figure 7. Outliers are show in pink

Software Available

Li and Wong's method is available through their program dChip, at Academic licenses are free.
The RMA method is available as part of the affy package in the Bioconductor tools suite: see A commercial software vendor, Iobion, has incorporated RMA into their GeneTrafic product.

Genomics and Pharmacology Facility
                Home Page Link to Center for Cancer Research Home Page Link to National Cancer Institute Home Page Link to National Institutes of Health Link to Department of Health & Human Services Home Page