Genomics and Bioinformatics Group Genomics and Bioinformatics Group Genomics and Bioinformatics Group
Genomics and Bioinformatics Group

Microarray Data Analysis

Genomics and Bioinformatics Group
  Molec Maps
Two-Color Normalization

Preprocessing of Two Color Competitively Hybridized Arrays – From Image to Estimates

This section covers the low–level preprocessing steps from the point at which the chip is scanned, to obtaining reliable estimates for the relative gene abundances of each gene in all of the samples. Broadly these steps may be classified as image analysis, quality control, background correction, and normalization, although all of these procedures are inter–dependent, and not always done in this order.

Image Quantification

Although the lab technician usually handles this step, using the default settings on an image quantification program, the program and the settings can have a noticeable impact on the noise level of the subsequent estimates. There are several steps in image quantification:
  • laying the grid – finding where the printed spots ought to be in the image
  • identifying the extent of each spot, and separating foreground from background
  • summarizing the varying brightnesses of the pixels in the foreground of each spot
  • dealing with scanner saturation
  • dealing with variable backgrounds
The technician does the first step interactively. The quantification program must deal with several problems in the second step. It is rare that the probes are uniform in size and shape; most programs try to adapt to the sizes, and sometimes shapes, of individual spots. Sometimes (poorly stabilized) neighboring probes bleed for some region around which may invert the relation between foreground and background (see figure).
Problems quantification must deal with
Figure 1. Some problems that quantification must deal with
The different programs each try different approaches to these problems, and they make a difference in the reproducibility of gene expression measures. A study done in California showed considerable differences among results from different quantification programs applied to 8 arrays that compared the same two samples. Several different settings were used on most programs. Since ratios should be identical, the standard deviation is an (inverse) measure of quality.
Noise levels quantified in different ways
Figure 2. Noise levels from the same 8 cDNA slides, as quantified in many different ways (courtesy Jean Yang, UCSF).

Background Correction

The previous figure shows that background subtraction added to the noise in those measures, and many researchers have had similar experience. On the other hand it seems in principle wrong to ignore background, and some substrates (eg. poly-L lysine) show substantial additive background; ignoring this leads to substantial bias in the estimates of most gene ratios. At the moment there is no consensus, but two suggestions may help for the time being.
It may be pragmatic to ignore the background correction, if the goal is to detect a few differentially expressed genes among the multitude of noisy similarly expressed genes. For this purpose noise control is most important. On the other hand, when it comes time to estimate the fold-change, subtracting background, or some other form of background correction gives more accurate estimates.
Perhaps the resolution of this paradox will come with more sophisticated forms of background correction. The Spot program and the Agilent software both do something different than local background estimation, and both give less noisy estimates than subtraction. In principle the raw intensity of a spot is made up of fluorescence from labeled transcripts, plus reflection or emission from the substrate. The label fluorescence comes from the target transcript and also a mixture of other transcripts that have bound non–specifically to the spot. The local background is made up of the reflection from the substrate, and stray bits of labeled transcript that have bound to the surface; sometimes if a nearby probe has spread, it includes labeled transcript from that other gene). The make-up of the local background differs from the background we want to correct on the spot. On the other hand the negative controls show just non-specific hybridization, which is the bulk of what we want to correct on each gene probe. Perhaps a better way would be to subtract a weighted average of the local background and the values of negative controls nearby.

Quality Control

A lot of the messy business of statistics is cleaning up data. Although this is less exciting than higher–level analysis, it makes as much of a difference to the results as normalization and other processes.
Wet Lab Quality Checks
The best place to check quality is in the wet lab, before the measures are taken. Two standard checks are RNA quality and dye incorporation.
Between the time that a sample is taken, and the time the RNA is extracted and purified, enzymes in the cell rapidly degrade mRNA by cutting it into shorter pieces. Most of these short pieces will hybridize more easily to several different probes, which distorts expression measures. One way to detect degraded RNA is to examine two abundant types of RNA – the 18S and 28S ribosomal RNAs. If the ribosomal RNAs are mostly intact they form two sharp peaks as the total RNA is washed through a gel. This may be done also with a commercial tool such as the Agilent BioAnalyzer.
Since the measures depend directly on the how much of the labelling dye is present on a probe, it makes sense to check how well the label is incorporated in the sample. In practice the amount of label in different samples varies, especially for the red Cy-5 dye. Microarray technicians have often observed that the Cy5 label is taken up poorly in hot humid summers. Researchers from Agilent have recently confirmed that even moderate levels (5ppb) of ozone can degrade Cy5, while not affecting Cy3. A commercial product to measure how much label is incorporated in the sample is the NanoDrop Probe.
Since it is much more trouble to detect and correct problems after the hybridization, it is worth the effort to check for hybridization problems in the lab. You may discard chips with very problematic hybridizations. You should do this before testing your favorite hypothesis; in the real world, you often do it interactively as you find faults with chips that do not fit your ideas so well.
If the sample RNA and the labelling pass the wetlab quality checks, then the best information about the process of hybridisation comes from the controls. There is no excuse for chips without a well–designed set of negative and positive control probes. Negative controls are probes designed for DNA sequences that should never occur in your sample. Positive controls are replicate probes for sequences that should be abundant. Both positive and negative controls should be distributed over the chip. Spike-in controls are probes for transcripts not expected in your sample but added (in known amounts) to the samples before labelling.
The negative controls should all report low signal, and this low value should be fairly uniform (i.e. it should not show any pronounced spatial pattern, although control probes from different genes will typically have different means). The signal from negative controls gives an idea of the background in all signals due to non-specific hybridization. You won’t be able to estimate reliably those gene abundances, whose signals are comparable to the signals from negative controls, even if they are above the local background.
Positive controls give some idea of the spatial variation in hybridization. Probes for the same gene should show fairly uniform intensities across the chip. If the positive controls are very different from their average in some region, it is worth taking a closer look, and perhaps discarding all signals from that region. It is common to see spatial gradients in intensity, and sometimes in ratio. Often during hybridization a two-color chip is placed on a surface, which is not precisely level. More of the sample is present at one end of the chip, or along one side. Then one end of the chip is brighter than the other; this is not a serious problem unless the log ratios show a similar gradient.
In data from poorly functioning hybridization stations one often observes uneven signal and high background around the inlet ports; it seems the turbulent fluid affects the hybridization reaction. One should discard signals from the affected regions, and if this uneven pattern extends for a long way it is better to discard the chip.
Spike-in controls (such as Amersham’s) give some idea of the accuracy and linearity of the intensities. Some genes are added in ratios of 3:1 or 10:1 to the two samples. Typically one sees that the ratios as reported are squashed, and sometimes that the low intensity genes show different ratios than the majority. These can be very useful additional information during normalization, but they should not be taken too literally, as they rarely work exactly as expected.

Quality Control of Individual Probes

If the samples and the hybridisation pass the previous tests, the next step is QC for individual spots or probes. Spot-level QC detects mostly printing problems rather than hybridization anomalies. Most image quantification programs flag spots that fail their internal QC measures; it is rarely a good idea to keep spots that have been flagged. You may want to do further QC of individual spots based on several other measures reported by the image processing program (GenePix and Quantarray give many). Some reported measures are often: spot area, uniformity (standard deviation of foreground), and background uniformity. It is not practical to examine thousands of spots individually; an automated filtering procedure is what is needed. However the filtering criteria that are useful for one experiment, are too slack or too strict for the next; there are not rules about spot size, or background that apply across the board to all chips under all circumstances.
A sensible thing to do before filtering is to examine the distribution of the various measures across the chips for each new experiment. Then identify the ‘normal’ ranges for each of these variables, and what are unacceptable. Then discard (or down-weight) all those spots that fall outside the ‘normal’ range. This is best done in collaboration with a good core facility.
The printer often drops small amounts of probe, elsewhere than intended. This becomes a problem if a spatter of probe for a highly expressed gene lands on a probe for a faint gene; then the signal from both channels reflects the abundant gene, rather than the gene that is annotated at that position. Another type of problem is spot formation – printers aim to deliver fairly round, even sized spots. When they fail, printed clones may flow into each other. So in practice it makes trouble to use data from extremely small, or extremely large spots, or those that are very irregular. Further measures you might use in batch filtering depend on the level of noise in the image, and the uniformity of the color ratios.
section of CDNA image
Figure 1. Section of cDNA image: some spots run into each other; these spots have excessively large areas.
The area criterion is the easiest to apply and understand. Spots whose size is only a few tens of pixels are much more likely to be scatterings of bright probe; extremely large spots are likely to be mingled with their neighbors.
Plot of one quality score
Figure 2. A plot of one quality score as a function of diameter, for a grid where the intended diameter is 100 microns, and the inter–spot distance is 200 microns.
The uniformity criteria are perhaps the most complex, because there are so many options, and no underlying explanations for variation. High foreground variability is obviously a problem – it makes it very hard to be confident about the real ratio. However it is not clear from first principles, what is an acceptable variation, and different chips have very different distributions of the uniformity measures. Some programs give the red–green correlation, which is a direct measure of the replicability of the ratio measures; values less than 0.8 should be down-weighted or discarded. Usually one has to decide based on indirect measures of uniformity. Most programs give both a mean and a median for a spot. If the spot has a reasonable distribution of pixels, the mean and median should be similar. If they are quite different, something strange is happening, such as a droplet. We accept spots if the mean and median differ by at most 15%: |μ – m*| < 0.15(μ + m*)/2. Many image quantification programs now give standard deviations for the foreground and background in both channels. A reasonable criterion is to accept a spot if the foreground is well above the bg noise: μfg > μbg + 2σbg. Unfortunately this often fails for a majority of spots on some chips. At this point it is not clear whether these chips are really poor, or whether the criterion is too strict.
Some chips feature duplicate probes for each gene. We use a 15% criterion there also: Accept if |μ1-μ2| <0.15(μ1+μ2)/2. There may be some point in doing a spatial normalization here before applying this quality control criterion.
It is simplest to set up criteria as filters, and to exclude spots that fail any quality criterion at a certain threshold. However in practice few spots may pass all criteria, even with reasonable thresholds for each. Some groups use a composite score. (Wang et al 2002) construct quality measures q1, q2, q3, and q4, based on area, signal-to-noise, background level, and variability; they define a composite score q* = (q1q2q3q4)1/4 , and reject a spot if the composite q* < 0.8. The threshold of 0.8 is somewhat arbitrary, although spots in their arrays with q* ~ 0.5 have twice the random variation of those with q* > 0.8.
In principle, most quality measures are continuous, and while there are obvious outliers, there is no clear–cut threshold. A better procedure than filtering would be to down weight probe signals, in further analysis, based on quality score. This poses a practical problem for most people, since it is difficult to use weight information in packaged software, although it is easy to adapt hand-coded R routines to weighted signals
Effects of Atmospheric Ozone on Microarray Data Quality
Thomas L.Fare, Ernest M. Coffey, Hongyue Dai, Yudong D. He, Deborah A. Kessler, Kristopher A. Kilian,* John E. Koch, Eric LeProust, Matthew J. Marton, Michael R. Meyer, Roland B. Stoughton, George Y. Tokiwa, and Yanqun Wang
Anal. Chem., ASAP Article 10.1021

Normalization of Competitively Hybridized (Two-Color) Microarrays

Why Normalize?
Biologists have long experience coping with systematic variation between experimental conditions that is unrelated to the biological differences they seek. However expression arrays have even more ways to vary systematically than measures such as rt-PCR. In practice methods that have worked well for these types of measures do not perform as well for microarray data, which shows many dimensions of systematic differences.
Normalization is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Differences in treatment of two samples, especially in labelling and in hybridization, bias the relative measures on any two chips.
Systematic non-biological differences between chips are evident in several ways:
  • Total brightness differs between chips
  • One dye seems stronger than the other (in 2-color systems) on one chip, but not on another
  • Typical background is higher in one chip than on another.
There are also many non–obvious systematic differences between chips in an experiment, and even between the two channels on a single array.
Some causes of systematic measurement variation include:
  • Different amounts of RNA
  • One dye is more readily incorporated than the other (in 2-color systems)
  • The hybridisation reaction may proceed more fully to equilibrium in one array than the other
  • Hybridisation conditions may vary across an array
  • Scanner settings are often different, and of course
  • Murphy’s Law predicts even more variation than can be simply explained.
In order to identify the real biological differences, we attempt to compensate for the systematic technical differences in measurement. Although the aims of normalization for all arrays are similar, the issues and techniques used in normalization of two-color arrays differ from those useful for normalization for Affymetrix arrays.
Housekeeping Genes
One early approach was to find a standard gene invariant across all chips or samples. This the commonsense approach, used routinely in rt-PCR. The standard genes tried were ‘housekeeping’ genes – genes, required in all cell types – on the theory that they occur at nearly equal levels in all cells. However, these housekeeping genes seem to vary by 30% or more across interesting samples<ref>. This is quite sufficient accuracy for rt-PCR, where one cycle corresponds to an increase of a factor of two. However in a microarray study a 30% difference over the whole genome is enormous.
Quantitative Approaches
Most approaches to normalizing expression levels assume that the overall distribution of RNA numbers doesn't change much between samples, and that most individual genes change very little across the conditions. This seems reasonable for most laboratory treatments, although treatments affecting transcription apparatus have large systemic effects, and malignant tumours often have dramatically different expression profiles. If most genes are unchanged, then the mean transcript levels should be the same for each condition. An even stronger version of this idea is that the distributions of gene abundances must be similar.
Statisticians call systematic errors, which affect a large number of genes, ‘bias’. Keep in mind that normalization, like any form of data ‘fiddling’ adds noise (random error) to the expression measures. You never really identify the true source or nature of a systemic bias; rather you identify some feature, which correlates with the systematic error. When you ‘correct’ for that feature, you are adding some error to those samples where the feature you have observed does not correspond well with the true underlying source of bias. Statisticians try to balance bias and noise, and their rule of thumb is that it is better to under-correct for systemic biases than to compensate fully.
Scaling by Brightness
The simplest approach posits that total abundance of all genes is equal in the two samples on any one chip. Scaling a chip means multiplying the signals (intensity measures) for all genes by a common scale factor. This makes sense if equal weights of RNA from the two samples are hybridised on the array. The sizes of the RNA molecules are comparable, so the number of RNA molecules should also be the roughly the same in each sample. Consequently, approximately the same number of labeled molecules from each sample should hybridise to the arrays and, therefore, the total of the measured intensities summed over all elements in the arrays should be the same. For a single chip compute scale factors Cred and Cgreen , by:
where fired represents the measured intensity of array element i in the red channel, and N is the total number of elements represented in the microarray. Individual ratios are then scaled by their sum:
After this operation, the intensity of genes in each color is equal to one, while individual intensities are inconveniently small. Often researchers choose one channel (eg. Green) to be the standard, and multiply the other by a scaling constant (eg. Cgreen/Cred ). The result is that the mean of gene expression values is the same in both channels, and that the mean difference (mean of all subtracted intensities) is 0.
Sometimes this operation is done on a logarithmic scale, which has a somewhat different result: a mean log–ratio equal to zero; this means the (geometric) mean of the individual gene ratios is equal to 1. Done on a logarithmic scale, this operation is equivalent to subtracting the average log–ratio from all the individual log–ratios.
In order to make individual channels more comparable across chips, the same constant is used for all chips. In practice there are often anomalies at the top end, for examples a number of probes are saturated on one chip, but not on the other. More consistent results are obtained by using a robust estimator, such as median or one-third trimmed mean. To do the latter: compute the mean of the middle two-thirds of all probes in the red, and the green channels, and scale all probes to make those means equal. John Quackenbush suggested this originally, but TIGR now uses lowess – see below.
Two Parameter Normalization Methods
Whereas normalization adjusts the mean of the log-ratios within one chip, it is common to find also that the variance of the log-ratios differs between arrays. One approach to dealing with this problem is to scale the log2(ratio) measures (after scale normalization within chips) so that the spread (measured by the variance) of the log-ratios of genes is the same for all chips. This is an example of over-correcting a bias. This procedure usually works in reducing overall variance between log-ratios between chips, but sometimes the variability of many genes is actually increased. This approach is not widely used.
Intensity Dependent Normalization with Lowess
The scale normalization adjusts for overall dye bias. Terry Speed’s lab identified an intensity-dependent dye bias, and introduced a popular method for adjusting it. One commonly observes that the log2(ratio) values have a systematic dependence on intensity – most commonly a deviation from zero for low-intensity spots. Under–expressed genes appear up-regulated in the red channel. Moderately expressed genes appear up regulated in the green channel. No known biological process would regulate genes that way – so this must be an artefact. It appears that the explanation is chemical: the two dyes do not give off equal light per molecule at different concentrations. This is due to ‘quenching’: a phenomenon where dye molecules in close proximity, re-absorb light from each other, thus diminishing the signal. The amount of re-absorption changes with concentration differently for the two dyes.
The easiest way to visualize intensity–dependent effects is to plot the measured log2(Ri/Gi) for each element on the array as a function of the log2(RiGi) product intensities. This 'R–I' (for ratio–intensity) plot can reveal intensity–specific artifacts in the log2(ratio) measurements. Note that Terry Speed’s group calls these variables ‘M’ and ‘A’, (for ‘minus’ and ‘add’ – on the log scale) and they call the plot an ‘MA plot’.
Ratio intesity plot of banana shape
Figure 1. Ratio–Intensity plot showing characteristic 'banana' shape of cDNA ratios; log scale on both axes. (courtesy Terry Speed)
We would like a normalization method that can remove such intensity-dependent effects in the log2(ratio) values. The functional form of this dependence is unknown, and must depend on many variables we don't measure. An ad–hoc statistical approach widely used in such situations, is to fit some smooth curve through the points. One example of such a smooth curve is a locally weighted linear regression (lowess) curve. Terry Speed's group at Berkeley used this approach.
To calculate a lowess curve fit to a group of points (x1,y 1),…(xN,yN), we calculate at each point xi, the locally weighted regression of y on x, using a weight function that down–weights data points that are more than 30% of the range away from xi. We can think of the calculated value as a kind of local mean. For each observation i on a two-color chip, set xi = log2(RiGi) and yi = log2(Ri/Gi). The lowess approach first estimates y*(x), the value of the regression line through points having similar intensities, then subtracts this from the experimentally observed ratio for each data point.
The normalized ratios r* are given by log2(ri*) = log2(Ri/Gi) – y*(log2(RiGi)) .
The result is that ratios at all intensities have a mean of 0, as seen below.
Banana images corrected by lowess normalization
Figure 2. As in Figure 1, but corrected by lowess normalization.
Global versus local normalization.
Most normalization algorithms, including lowess, can be applied either globally (to the entire data set) or locally (to some physical subset of the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (sometimes referred to as a 'pen group' or 'sub grid'). Local normalization has the advantage that it can help correct for systematic spatial variation in the array, including inconsistencies among the spotting pens used to make the array, variability in the slide surface, and local differences in hybridization conditions across the array. However such a procedure may over fit the data, reducing accuracy, especially if the genes are not randomly spotted on the array; the approach assumes that genes in any sub grid should have average expression ratios of 1, and that several hundred probes are in each group. Another approach is to look for a smooth correction to uneven hybridisation. The thinking behind this approach is that most spatial variation is caused by uneven fluid flow. Flow is continuous, and hence the correction should be continuous as well.
There is still not a consensus about the best way to do local normalization.
Quantile Normalization
A good design will place all contrasts of interest directly on chips, but sometimes that is impossible, or the afterwards the experimenter wants a contrast that wasn't planned before. This requires comparing 'parallel' measures in a single channel between arrays. So many sources of systematic variation make such comparisons very difficult: variance is very high between parallel measures. We need a kind of normalisation that works across arrays as well as within arrays. It turns out that quantile normalization works quite well at reducing variance between arrays, while compensating the intensity-dependent dye bias, as well as does lowess normalization.

Genomics and Pharmacology Facility
                Home Page Link to Center for Cancer Research Home Page Link to National Cancer Institute Home Page Link to National Institutes of Health Link to Department of Health & Human Services Home Page