GBG logo LeFE Build:8 Genomics and Bioinformatics Group

How LeFE Works

LeFE is iteratively applied to each gene category. This picture depicts a toy example to demonstrate how LeFE would score the strength of association between a set of five 5 microarray experiments and a single category with two genes. In reality, the experiment would contain more microarrays and this process would be repeated for every category.


A: LeFE's Input B: The gene expression matrix is split into category genes and non-category genes C: The negative control genes are subsampled from the non-category genes. D: The signature vector and composite matrix are input into a random forest E: A random forest is trained on the data F: The random forests output is an importance value for every gene. G: A permutation t-test determines if the category's genes are more important than the background genes. H: The output of LeFE after it is applied to a single category.
A:There are three inputs into LeFE:
  • (i) A signature vector describes the biological behavior, process, or state to be predicted for each experimental sample. The signature vector either classifies samples (e.g., as normal or diseased) or assigns each sample a continuous value (e.g., relative drug sensitivity).
  • (ii) A matrix of gene expression data measured for each of sample.
  • (iii) A custom defined set of gene categories (not shown in the graphic above).
B: The set of genes are split up into category genes (blue) and non-category genes (green).
C: A subset of genes not in the currently analyzed category is selected to serve as negative control genes. The number of negative control genes selected is proportional to the number of genes in the category.
D: The vector of signature values (orange) and a composite matrix, consisting of the category’s genes and the negative control genes, are input into a random forest machine learning algorithm.
E: The random forest is trained to learn the signature vector assesses the importance of each gene to its trained model. The random forest’s multivariate models consider the genes importance within its biological context.
F: The result of training the random forest is a set of gene importance scores, one for every gene input into the random forest.
G: A non-parametric permutation t-test is used to determine if the genes in the category were deemed more important to the random forest models than the negative control genes. In order to ensure convergence of the algorithm, the steps C through G are repeated multiple times.
H: The result of the process, run on a single gene category, is (i) the median importance scores of every gene in the category, (ii) the category’s median permutation t-test p-value and (iii) an importance plot that compares the distribution of importance scores for the genes in the category and all of the different negative control genes.

Last Updated: August 16, 2007


LeFE™ is a development of the Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology (LMP), Center for Cancer Research (CCR), National Cancer Institute (NCI). Please email us with any problems, questions or feedback on the tool.

Notice and Disclaimer