High-Throughput GoMiner Process Overview
High-Throughput GoMiner carries out a number of processing steps. All of these (except
as noted) are the same for both the command-line and the web
interfaces. The following text and graphic describe this process flow.
- Configuration: In the command-line version, parameters for running the
program are established by editing a configuration file. In the web version, the user
selects the parameters from the web interface, and the appropriate configuration file
is generated.
- Quality Assurance: The program checks the total- and changed-gene files for various types of
errors, including gene name formatting errors. We recently showed that
Excel can
inadvertently alter gene identifiers as a result of default date and floating point
format conversions. High-Throughput GoMiner protects the user from those errors by scanning
the input identifiers for probable instances of such conversions. The command-line application
will exit with an error code if a problem is detected. The web application will return
a web page describing an error when one is detected.
- GoMiner Execution: The command-line interface of an instance of GoMiner is invoked to
generate a gene-category export file that is used for internal processing. The total-gene file
functions as both the total- and changed-gene input files.
- Random-Gene File Generation for Computing FDR: A set of random-gene files is generated by
sampling the genes in the
total-gene file. Each random-gene file contains the same number of genes as the changed-gene file.
The random-gene files are used for computing the false-discovery rate (FDR).
- Mapping Genes to GO Categories: The genes in the changed- and random-gene files are
joined with the entries in the gene-category export file generated in step 3. The Gene Ontology contains
categories called "obsolete" and "unknown." To avoid introducing errors in subsequent statistical
computations, we add a processing step in which genes are removed if they appear only in
"obsolete" or "unknown" categories. The net effect is to expunge those genes from further
consideration in both the total- and changed-gene files.
- Result Integration: Reports that integrate
the results computed from the multiple changed-gene files are
generated. These integrated reports include estimations of FDR, files from which
clustered image maps (CIMs) can be generated, and data from external resources,
such as transcription factors.
- Email Notification: In the web version, once the reports are generated,
the user is sent an email message with a link from which to download the results.
A Graphical View of the Process Flow