Genomics and Bioinformatics Group Genomics and Bioinformatics Group Genomics and Bioinformatics Group
Genomics and Bioinformatics Group

Microarray Data Analysis

Genomics and Bioinformatics Group
  Molec Maps
Design of Microarray Experiments

Design of Microarray Experiments

How many replicates is enough? Should you pool samples? What is a good design for a cDNA experiment?
The design of scientific experiments is an art of balancing considerations: skill, cost, equipment, and accuracy. For a given question, there won’t be one ‘right’ design: you may choose different designs for the same scientific question in depending on resources and long-range plans. That said, some common sense principles apply across the board (but are ignored more often than they should be).
If your goal is to make a series of experiments and to compare the results, ensure that the designs and hybridisation conditions are similar. Conditions such as RNA preservation medium, the protocols of hybridisation, and even regional ozone levels, can introduce systematic biases comparable in size to the biological differences you wish to detect. Taking a great deal of care to standardize conditions will pay off in much higher discovery rates.
To do a series of two-color hybridisations, you want to prepare enough common reference to serve for all experiments. Chip failures are common, and it is wise to prepare more labelled cDNA than you expect to use, if that is possible. Some of the more efficient designs will lose much information if a single hybridisation fails; you won’t want to use those designs if you can’t set aside samples to be re-hybridized quickly to chips from the same batch.


How many microarrays is enough? Statisticians do not like simple answers to this question; it depends on the goals of the study, the resources, and the reliability of the technology – specifically how accurate the chips are, and how often a hybridization fails. However the following guidelines apply to most situations. If an exploratory study aims to find large (more than two-fold) differences between two conditions, then a design with three samples per condition is usually adequate. If the aim is to find smaller differences, or almost all of the large differences, then five samples per group are necessary to obtain sufficiently reliable enough estimates of variation among samples within conditions, in order to distinguish true differences between conditions. This applies to both treatment and control conditions. Six samples per condition allows meaningful permutation tests, which can give more accurate, and less conservative, estimates of p-values and false discovery rates. If there are more than two conditions, and the treatments do not drastically alter the cell physiology, then the number of samples within any one condition can be somewhat less; with four or more conditions, one can obtain reasonable estimates of within-condition variation with only four samples per condition. All of these suggestions assume that there are no outlying samples, which should be discarded; it is wise to do one or two more per condition in clinical situations, where outliers occur commonly, and it is safer to do one more for animal experiments, where sometimes one animal in a condition appears very different than all the others.
The question of how many replicates to do depends on how small the differences are that you want to detect, and the noise level in your system. Different systems have different noise levels, and the only way to estimate the noise is to do three or four replicate hybridizations. For Affymetrix systems with the best analysis (see ), we find 3 to 5 chips per group gives useful information. Usually many more cDNA chips are needed for comparable levels of accuracy. To estimate replicability of a two-color chip, hybridize three pairs of replicate dye-swaps (6 chips) using the same two (different) RNA samples.
To do meaningful clustering requires at least 20 samples, and generally many more. The key issue for clustering genes is how many different types of samples there are, because the different conditions expose the correlations in gene regulation. It is not useful to try to cluster genes from only two groups, as is sometimes done, and rarely useful to cluster genes from a study of fewer than five groups.


There is considerable disagreement about whether to pool individual samples, among practitioners and also among statisticians. Sometimes the amount of sample from any one individual sample is insufficient for hybridization and in that case, pooling is a practical necessity. In theory, if the variation of a gene among different individuals is approximately normally distributed, then pooling n independent samples would result in reduction of variance given by the formula:
wheres2 is the variance of the expression estimates of any one gene across samples. In principle we could then reduce further the variation by making replicates of the pool, and hybridising to replicate arrays. Since technical variation is usually less than (roughly half of) individual variation, this strategy would in theory give us more accurate estimates of the group means for each gene.
Figure 1. Pooling
In practice the expression levels of many genes among individuals are not roughly normal; often there are more very high values (outliers) than the normal distribution. Some individual samples have levels of stress response proteins and immunoglobulins five to ten fold higher than typical. This can be due to many factors unrelated to the experimental treatment: for example, individual animals or subjects may be infected, or some tissue samples may be anoxic for long periods before preservation, which allows cells to respond to stress (Prichard et al "Project Normal", PNAS (2002)). It is easier to detect this, if individual samples are hybridized. In some studies (Terry Speed; unpublished data), where the same samples were analysed by pooled and unpooled designs, the majority of genes that were identified as differently expressed between two groups, turn out extreme in only one individual. Also, if one pools samples, there is no way to estimate variation between individuals, which is sometimes important and often interesting.

Designs for Two-Color Arrays

The most common design for two color (competitively hybridised spotted) arrays is the ‘reference design’: each experimental sample is hybridised against a common reference sample. Although this effectively means that only one sample of interest is hybridised per chip, the reference design has several practical advantages over more efficient designs:
  • it extends easily to other experiments, if the common reference is preserved;
  • is robust to multiple chip failures; and
  • reduces incidence of laboratory mistakes, because each sample is handled the same way.
We may represent designs by pictures where circles represent samples, and arrows represent chips; the red and green ends of the arrows represent the dyes used for the samples at either end.
 reference design
Figure 2. A reference design: the red and green arrows represent chips.
The reference sample is used in many chips, therefore the reference mRNA needs to be abundant. When comparing treatment versus control samples the most natural reference is the wild type or the biological controls, which are often the most bundant. However if the study aims to compare each of several samples against all others, there is no natural control. A reliable alternative is a common reference obtained by pooling all samples. This enables samples to be compared with each other indirectly. A pooled reference sample reduces the number of extreme gene ratios (which have large errors) on each chip. Some labs take this further and create a ‘universal reference’: a pool of mRNA derived from several standard cell lines, which they use most often in their experiments. Using a universal reference enables them to compare results for all their experiments.
One complication in two-color arrays is that the two dyes don’t get taken up equally well, so that the amount of label per amount of RNA differs (dye bias). An early proposal to compensate for dye bias was to make duplicate hybridizations with the same samples using the opposite labeling scheme. For example, to compare two samples: A & B, make two arrays (or an even number), and hybridize them as follows:
Array 1: A vs B ; Array 2: B .vs. A
The intent was to compensate dye bias by averaging ratios from dye-swapped hybridizations. However dye bias is not consistent, and in practice the ratios in dye-swap experiments don?t precisely compensate each other. Normalization methods such as lowess give more consistent results, although dye-swapping makes it easier to compensate for dye-bias. However the dye-swap is the basis for most other efficient designs: the general principles of a good two-color design are that
  • it should be balanced: every sample appears equally often in red and green;
  • the samples whose ratios are most interesting should appear on the same chips most often.
For comparing a number of samples of equal interest and high quality, a design that utilizes a large number of direct sample-to-sample comparisons is most accurate for the cost, from a theoretical perspective. The simplest of these is a ‘loop’ design: each sample is hybridized to each of two different samples in two different dye orientations. This design results in half the variance per estimate, because each sample occurs twice, rather than once; at the cost of only one more chip. The drawback is that if one chip fails, or is of poor quality, then the error variance for all estimates is doubled. This problem is so serious in practice that many microarray statisticians do not recommend the loop design.
A loop design
Figure 3. A loop design: arrows represent chips with samples labelled as indicated.
There are many efficient direct designs, which are also robust to failure, based on ‘round-robin’ style contrasts where each sample is hybridised to a specific subset of all the others, in a balanced fashion. These designs are appropriate where differences between any pair of samples are all equally important, and the experimenter does not plan to compare the expression values directly with other experiments in a longer series. The simplest of these designs for a small number of samples, is a ‘saturated’ design: to hybridise every contrast exactly once. It is fairly easy to balance the dyes with three or five samples; with four or six samples, it is not possible to exactly balance the number of times each sample is labelled red and green.
A saturated design
Figure 4. A saturated design
A more common situation is that some contrasts are more important than others. For example to investigate the role of a receptor, one prepares wild-type and mutant (eg. knockout) animals, and then administers a treatment (eg. a ligand) to half of each group, while giving a non-effective vehicle to the other half. Then there are four groups, and the contrasts of most interest are the effects of the ligand on the two groups (WT and KO); the contrast between WT and KO animals in each treatment group is less important, and the contrasts between WT treated against KO control, and vice versa, are uninteresting. A good design for this is to hybridise several dye-swap pairs between the treatment and control within each group, and perhaps to hybridise one or two slides between WT treated and KO treated, and between WT control and KO control. This design gives fairly accurate estimates of both effects of treatment vs. control (in WT and Mutant), which enables accurate comparisons between the effects; there is less accurate information about the direct comparison between WT and mutant, although in effect there is more than one slide’s worth of information, because there are several indirect paths to make the comparisons between WT treated and mutant treated, for example.
Design for a comparative study
Figure 5. A design for a comparative study of the effect of a treatment on two biological strains
A good general framework for estimating the abundances, and contrasts, is provided by using a linear statistical model. For example, the log ratios in figure 5 are each the difference of the log abundance in two of the four samples; we may construct a design matrix which specifies how the log ratios in all ten experiments are derived from the log abundances in the four chips; then the best estimate of the log abundances is obtained by solving the least squares problem for this design matrix. Since the equations in the design matrix are all differences, adding a constant to any one solution will give another solution; therefore a set of equations specifying that the sum of the log abundances is 0 (or some other constant) is needed to obtain a unique answer. The design matrix for figure 5 is:
Design matrix for the experiment
Figure 6. The design matrix for the experiment in figure 5; the last line is added to ensure a single solution to the equations.
Another approach to estimation selects one contrast that can be written as a sum of several others (eg in figure 5, the WT treated – KO treated contrast is in theory equal to the sum of WT treated – WT control plus WT control – KO control plus KO control – KO treated), and then estimates the remaining contrasts in terms of the ratios of the contrasts on the hybridised chips. This approach is implemented in the limma package available through bioconductor.

Genomics and Pharmacology Facility
                Home Page Link to Center for Cancer Research Home Page Link to National Cancer Institute Home Page Link to National Institutes of Health Link to Department of Health & Human Services Home Page