Table of Contents
Introduction
SarcomaCellMinerCDB is an interactive web application that simplifies access and exploration of Sarcoma cancer cell line pharmacogenomic data across different sources (see Metadata section for more details). Navigation in the application is done using main menu tabs (see figure below). It includes 6 tabs: Univariate Analyses, Multivariate Analysis, Mutation variants, Metadata, Search, Help and Video tutorial. Univariate Analyses is selected by default when entering the site. Each option includes a side bar menu (to choose input) and a user interface output to display results. Analysis options are available on the top for both the Univariant Analysis and Regression model tabs (see sub-menu on figure). The sub-menu first option result is displayed by default (Figure 1).

Figure 1: Main application interface
Univariate Analyses
Molecular and/or drug response patterns across sets of cell lines can be compared to look for possible association.  The univariate analysis panel includes 4 options: Plot data, Download Data, Compare Patterns and Tissue Correlation. Almost all options have the same input data in the left side panel.
-  The x-axis data choices includes 4 fields to be filled by the user:
- x-Axis Cell Line Set selects the data source. The user can choose: NCI Sarcoma, Global Sarcoma, CCLE, GDSC, CTRP, Achilles or MD Anderson (see Data Sources for more details). 
- x-Axis Data Type selects the data type to query. The options for this vary dependent on the source selected above, and appear in the x-Axis Data Type dropdown. See the Metadata tab for descriptions and abbreviations. 
- Identifier selects the identifier of interest for the above selected data type. For instance, if drug activity for the NCI Sarcoma is selected, the user can enter a single drug name or drug ID (NSC number). The Search IDs tab explores potential identifiers interactively, or to download datasets of interest. 
- x-Axis Range allows the user to control the x-axis range for better visualization.
 
 
 
-  The y-axis data choices are as explained above for the x-axis.
 
 
-  Selected tissues: by default, all tissues are selected and included in the scatter plot. To include or exclude cell lines from specific tissues, the user should specify:
- Select Tissues to include or exclude specific tissues
- Select Tissues of Origin Subset/s functionality at the bottom of the left-hand panel. The tissues of Origin are organized as a tree and are all selected by default. In order to select a specific tissue, the user should click on the root of the tree represented by the triangle icon to expand the tree recursively until reaching a specific sub tree or leaf. The selection is finalized by clicking on the leaf label.On Macs, more than one tissue of origin may be selected using the “command” button. On PC's use the “control” key. All cell lines were mapped to the four-level OncoTree cancer tissue type hierarchy developed at Memorial Sloan-Kettering Cancer Center. In the CellminerCDB application, a tissue value is coded as an OncoTree node that can include elements from level 1 to level 4 separated by “:” character.
- Tissues to Color to locate cell lines related to desired tissues within the scatter plot. By default, the cell lines are colored by their OncoTree cancer tissue level 1 pre-assigned color. The user has now the option to select up to 4 tissues with different colors (red, green, dark blue and orange) and the remaining cell lines will be colored in light blue. The Show Color checkbox should be active.
 
 
 
Plot Data
Any pair of features from different sources across common cell lines can be plotted (as a scatterplot) including the resultant Pearson correlation and p-value. The p-value estimates assume multivariate normal data, and are less reliable as the data deviate from this. Please use the scatter plot to check the data distribution (e.g., for outlying points outside of a more elliptically concentrated set).
Some options are available to play with the plot image using icons on the top from left to right:
 |  | Downloads the plot as a png. | 
 |  | Allows the user to zoom in on an area of interest by clicking and dragging with the pointer. | 
 |  | Autoscales the image. | 
 |  | Allows the user to create horizontal and vertical line from either a cell line dot or the regression line, by hovering over them. | 

Figure 2: An example scatterplot of SLFN11 gene expression (x-axis)  versus Topotecan drug activity (y-axis) both from the NCI Sarcoma. The Pearson correlation value and p value appear at the top of the plot. A linear fitting curve is included. This is an interactive plot and whenever the user changes any input value, the plot will be updated. Any point in the plot can be hovered over to provide additional information about cell line, tissue, Onco tree designation,  and x and y coordinate values.
View Data
This option both displays the data selected from the Plot Data tab in tabular form, and provides a Download selected x and y axis data as Tab-Delimited File option. The user can change the input data in the left selection panel as described for Plot Data. The displayed table include the cell line, the x-axis value, the y-axis value, the tissue of origin and the 4 onco-tree levels. Within the header the selected features are prefixed by the data type abbreviation and post-fixed by the data source.

Figure 3: Shows the selected values for SLFN11 gene expression (x-axis) and Topotecan (id 609699) drug activity (y-axis) from the NCI Sarcoma across all common lines. The features are coded as expSLFN11_uniSarcoma and act609699_uniSarcoma where “exp” and “act” represent respectively prefixes for microarray gene expression and drug activity.
Compare Patterns
This option allows one to compute the correlation between the selected feature as defined from the specified x- Axis Cell Line Set, x-Axix Data Type, and Identifier and either all drug or all molecular data from the (same) x-Axis or y-Axis source. By default all tissues are selected however the user can restrict the analysis to specific tissue of origin.
Pearson’s correlations are provided, with reported p-values (not adjusted for multiple comparisons) in tabular form. This displays features are organized by level of correlation, and includes target pathway for genes and mechanism of action (MOA) for drugs (if available). 

Figure 4: Shows correlation results for SLFN11 gene with all other molecular features for all NCI Sarcoma datasets sorted by correlation value with gene location and target pathways (annotation field).
Tissue Correlation
This option enables to display per tissue of origin (oncotype level 1) the number of cell lines with complete observations (non missing values), the correlation between the selected paired features and its p-value.

Figure 5: Shows the correlation between the selected values for SLFN11 gene expression (x-axis) and Topotecan (id 609699) drug activity (y-axis) from the NCI Sarcoma across all common lines by tissue of origin. Note: The value “ALL” means all available common tissues between the 2 selected features.
Multivariate Analysis
The ‘Multivariate Analysis’ option (or module) has multiple tabs including Heatmap, Data, Plot, Cross-Validation, Tehnical Details and Partial Correlation (described below), and allows construction and assessment of multivariate linear response prediction models within a single cell line set. For instance, we can assess prediction of a drug activity based on some genes expression. To construct a regression model, you first need to specify the input data in the left side panel.
- The response variable is chosen by selecting:
- Response Cell Line Set selects the data source for the response variable. The user can choose: NCI Sarcoma, CCLE, GDSC or CTRP (see the Data Sources section of Help for more details on these Cell Line Sets). 
- Response Data Type selects the data type for the response variable (example: a drug or a molecular dataset). The options for this vary dependent on the source selected above, and appear in the Response Data Type dropdown. See the Metadata tab for data types description.
- Response Identifier selects the identifier for the response variable (e.g., a specific drug or gene identifier)
 
 
 
- The predictor variables are chosen by selecting:
- Predictor Cell Line Set selects the data source for the predictor variable. The user can choose: NCI Sarcoma, CCLE, GDSC or CTRP.
- Predictor Data Type/s selects the data types(s) for the predictors variables. Use command button on Macs or control key on PCs to select more than one dataset.
- Minimal Predictor Range provides a required minimum value for the identifier to be included for the first listed data type. The default is 0. One may increase this value to eliminate predictors that are considered to have insufficient range to be biologically meaningful.
- Predictor Identifiers selects the identifiers for the predictors.When using the Linear Regression algorithm, predictors are required to be enter. In figure 5, we explore linear model prediction of Topotecan drug activity in the NCI Sarcoma choosing SLFN11 and BTPF gene expression. Identifiers from different sources may be combined using 2 methods. In the first, select multiple Data Types as desired, and enter your identifiers. The model will be built automatically using those Data Types and Identifiers. For example, if expression and mutation are selected as Data Types and SLFN11 and BPTF are entered as identifiers, the model will be built using 4 identifiers: expSLFN11, expBTPF, mutSLFN11 and mutBTPF. In the second, more specific approach, you enter the identifier with the data type prefix. For example, if your predictor variables are specifically the expression value for SLFN11 and mutation value for BTPF then you can enter as identifiers: expSLFN11 and mutBTPF. When using the Lasso algorithm, predictors are optional for the Lasso algorithm (see point 4) since it identifies automatically the ones that best fit the Lasso model.
 
 
 
- Select Tissue/s of Origin is used to include or exclude specific tissues, as defined in the next step. By default, all tissue types are included, howver you can select one or any multiple of tissue types (to include or exclude). Use the radio buttons To include or To exclude to select specific tissues to include or exclude.  To make selections on Macs, use the “command” key. To make selections on PC's use the “control” key
 
 
- Algorithm: by default, the Linear Regression model is selected however you can also select the Lasso model (penalized linear regression model) machine learning approach. Linear regression is a linear approach to modeling the relationship between a response (or dependent variable) and one or more predictor variables (or independent variables). It is implemented using the R stats package lm() function. 10-fold cross validation is applied to fit model coefficients and predict response, while withholding portions of the data to better estimate robustness. Lasso is Least absolute selection and shrinkage operator, a penalized linear regression model. Lasso is implemented using the cv.glmnet function (R package glmnet). Lasso performs both variable selection and linear model coefficient fitting. The lasso lambda parameter controls the tradeoff between model fit and variable set size. Lambda is set to the value giving the minimum error with 10-fold cross-validation. The lasso lambda parameter controls the tradeoff between model fit and variable set size. The Lambda is set to the value giving the minimum error with 10-fold cross-validation. Set.seed, the initial seed is set to 1. Alpha is set to one. The minimum lambda is used to select the intercept and the coefficient for the variable (there is no range). 10-fold cross validation is applied to fit model coefficients and predict response, while withholding portions of the data to better estimate robustness. For further details on either of these outputs, see the respective R packages. If Lasso algorithm is selected, you have to specify:
- Select Gene Sets: The gene selection is based on curated gene sets such as DNA Damage Repair DDR or Apoptosis. The user can select one or more gene sets.
- Maximum Number of Predictors allows choice of the number of predictors (default 4)
 
Once all the above information is entered, a regression model is built and the results are shown in different ways such as the technical details of the model, observed vs. predictive responses plots or variables heatmap. Find below an explanation of different output for the regression model module.
Heatmap
This option provides the observed response and predictor variables across all source cell lines as an interactive heatmap. For the heatmap visualization, data are range standardized (subtract the minimum, and divide by the range) to values between 0 and 1, based on the value range within all rows of a given data type (by default) or within each row of data (if ‘Use Row Color Scale’ is selected). For data types other than mutation data, the range is trimmed to the difference between the 95th and 5th percentiles; values below or above the 5th and 95th percentile values are scaled to 0 and 1, respectively. In the case of mutation data, the range used for scaling is the difference between the maximum and minimum values. If the values within a data type (or data row if ‘Use Row Color Scale’ is selected) are constant, the scaled value for heatmap visualization is set to 0.5.
The user can restrict the number of cell lines to those that have the highest or lowest response values by selecting Number of High/Low Response Lines to Display. The user can download the heatmap related data by clicking on Download Heatmap Data.

Figure 6: An example heatmap where we selected topotecan as a response variable and SLFLN11 and BPTF gene expression as predictor variables. In this example, we chose to display only 60 cell lines that have the most 30 highest and 30 lowest values for topotecan activity.
If the Lasso algorithm is selected (see below) more predicted variables are shown (PSN2, SMARCD1, DFFB and ARID1A)

Figure 7: Same example as previous figure with the lasso algorithm
Data
This option shows the detailed data for the model variables for each cell line. Both the 10-fold cross validation (CV) as well as the predicted responses are given. The data is displayed as a table with filtering options for each column. 

Figure 8: Data related to the simple linear regression model presented in the previous section.
Plot
This option enables one to plot and compare the observed response values (y-axis) versus the predicted response values (x-axis). The predicted response values are derived from a linear regression model fit to the full data set.

Figure 9: Plot comparing Topotecan observed vs. predicted activity with correlation value of 0.59
Cross-Validation
This option enables plotting the observed response values (y-axis) versus the 10-fold cross-validation predicted response values (x-axis). With this approach, the predicted response values are obtained (over 10 iterations) by successively holding out 10% of the cell lines and predicting their response using a linear regression model fit to the remaining 90% of the data. After all 10 folds have been done, each sample has one cross-validated prediction (since each sample gets in the test set once). We compute the correlation between these cross-validated predictions and the true responses.
Cross-validation is widely used in statistics to assess model generalization to independent data – with the caveat that the independent data must still share the same essential structure (i.e., probability distribution) as the training data. It can also indicate possible overfitting of the training data, such as when the observed versus full data set model-predicted correlation (shown in ‘Plot’) is substantially better than the observed versus cross-validation predicted correlation (shown in ‘Cross-Validation’).

Figure 10: Plot comparing Topotecan observed vs. cross-validation predicted activity with correlation value of 0.51
Technical Details
This option enables the user to view the R statistical and other technical details related to the predicted response model. To save, these results may be copied and pasted into the document or spreadsheet of your choice. 

Figure 11: Example of regular regression model fitting results
Partial correlations
This function is used to identify additional predictive variables for a multivariate linear model. Conceptually, the aim is to identify additional predictive variables that are independently correlated with the response variable, after accounting for the influence of the existing predictor set. Computationally, a linear model is fit, with respect to the existing predictor set, for both the response variable and each candidate predictor variable. The partial correlation is then computed as the Pearson’s correlation between the resulting pairs of model residual vectors (which capture the variation not explained by the existing predictor set). The p-values reported for the correlation and linear modeling analyses assume multivariate normal data. The two-variable plot feature of CellMinerCDB allows informal assessment of this assumption, with clear indication of outlying observations. The reported p-values are less reliable as the data deviate from multivariate normality.
In order to run a partial correlation analysis, the user should first construct a linear model (providing response and predictor variables as explained earlier - steps 1 to 4 in figure below-) and then:
- Select Gene Sets: The gene selection is based on curated gene sets. Here the user can select one or more gene sets and even all genes (step 5 in figure below)
- Select Data types: the user can select one or more data type such as gene expression, methylation or copy number variation (step 6 in figure below)
- optionally, specify the Minimum Range for the first listed data type (step 7 in figure below)
- And finally click on button run (step 8 in figure below).

Figure 12: An example of  partial correlation results for selected gene expression data using all gene sets.
Exploratory workflow
Mutilple data analysis workflows may be used dependent of the question being asked. A typical workflow:
-  Check the relationship between two variables [2D plot]. Example: SLFN11 transcript expression and topotecan drug activity.
 
-  Examine what else might be associated with either the x-axis or y-axis variable [Pattern Comparison]. Example: considering potential biological affects, TGFBR3 (an apoptosis factor) and BPTF (a chromatin factor) transcript expression might be considered candidates for affecting topotecan activity.
 
-  Upon finding two or more associations with single 'response' variable through [Pattern Comparison/2D Plot], check if they complement one another in a multivariate model [Regression Models]. Example: Starting with the dominant SLFN11, adding TGFBR3 does not add to the regression model, but BPTF does.
 
-  Repeat the above steps as needed.
Mutation Variants
This option enables querying mutation variants per gene for all cell lines. For each variant we provide its location (chromosome, start ans end positions), its variant allele frequency (VAF), mutation type and Amino Acid change. First the user should specify the Cell Line Set or data source to view all available associated data types. Then he needs to enter the Gene symbol. Once the variants' table is displayed , the user can scroll it horizontally for more useful variant information. 

Figure 13: Shows all mutation variants for SLFN11 in NCI Sarcoma cell lines
This option enumerates for each cell line set, the available data types that could be queried within the app providing the data type abbreviation or prefix, description, feature value unit (z-score, intensity, probability …), platform or experiment and related publication reference (pubmed). First the user should specify the Cell Line Set or data source to view all available associated data types. Then he can download data via: Select Data Type to Download and then click on Download Data type and/or Download Data Footnotes to download any data or footnotes for the selected cell line set. Finally the user has the option to Download current cell line set information and Download drug synonyms table with matching IDs for all cell line sets by clicking respectively on Download cell line annotation and Download table.

Figure 14: Shows all data types for NCI Sarcoma
Search IDs
This page lists the identifiers (ID) available in the selected data source for use in the univariate analysis or Multivariate Analysis. The user chooses:
- Cell Line Set selects the data source. The user can choose: NCI , CCLE, GDSC, CTRP… (see Data Sources for more details).
- Select Data Type selects the data type to query. The options for this vary dependent on the source selected above, and appear in the x-Axis Data Type dropdown. See the Metadata tab for descriptions and abbreviations.
This enables to search all related ID for each combination. For the molecular data, the gene names (ID) and specific data type information are provided. For the drugs and compounds, the identifiers (ID),  Drug name (when available), Drug MOA and CLINICAL STATUS (when available) are displayed. The user can scroll down the whole  list of IDs, or search specific ID(s) by entering a value in the header of any column.
Drug IDs
For the NCI Sarcoma, the drug identifiers (ID) are NSC's or names. For the CCLE, GDSC, and CTRP, the drug identifiers are the Drug names.

Figure 15: Example of a search: if looking for a drug ID in the NCI Sarcoma select “NCI Sarcoma” as the cell line source and select “Drug Activity” as the data type. You can type in search box of column “Drug name”, “MOA” or “CLINICAL.STATUS”.
Gene IDs
For all data sources, the gene ID is the Hugo gene symbol however the application also recognizes any synonym or previous symbol (alias) that is included in the Hugo database.

Figure 16: Example of a search: if looking for a gene ID in the NCI Sarcoma select “NCI Sarcoma” as the cell line source and select “gene expression” as the data type. You can type in search box of column “gene name” or “entrez gene id” or “Chromosome”…
Navigation guide
Multiple selection
In order to select multiple choice from a list, use “command” button for Mac or “alt” button for PC and then click
X-axis or Y-axis range
You can change the x-axis or y-axis lower or higher value to have different views of the displayed plot.
Show color
It is a checkbox that enable and disable colors in the scatter plots
Please send comments and feedback to 
- fathi.elloumi AT nih.gov 
- aluna AT jimmy.harvard.edu 
- vinodh.rajapakse AT nih.gov
Data Sources
CellMinerCDB integrates data from the following sources, which provide additional data and specialized analyses.
 Figure 18: Cell line overlaps between data sources.
Figure 18: Cell line overlaps between data sources.
About the Data
For specific information about the data made available for particular sources, please refer to the 'Metadata' navbar tab.
Drug mechanism of action details:
Gene sets used for annotation of analysis results or algorithm input filtering were curated by the
NCI/DTB CellMiner team, based on surveys of the applicable research literature.
Release history
March 2021: release v1.0
- New NCI Sarcoma datasets
- CCLE, GDSC, CTRP, MD Anderson and Achilles Sarcoma specific datasets
About CellMinerCDB
The Sarcoma CellMinerCDB application is developed and maintained using R and Shiny by:
- Fathi Elloumi; Bioinformatics Software Engineer, Developmental Therapeutics Branch, National Cancer Institute
- Augustin Luna; Research Fellow, Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard Medical School
- Vinodh N. Rajapakse; Postdoctoral Fellow, Developmental Therapeutics Branch, National Cancer Institute
- William C. Reinhold
- Sudhir Varma
- Margot Sunshine
- Fathi Elloumi
- Lisa Loman (Special Volunteer)
- Fabricio G. Sousa
- Kurt W. Kohn
- Yves Pommier
Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard Medical School
MSKCC Computational Biology
- Jianjiong Gao
- Nikolaus Schultz
References
Shankavaram UT, Varma S, Kane D, Sunshine M, Chary KK, Reinhold WC, Pommier Y, Weinstein JN. CellMiner: a relational database and query tool for the NCI-60 cancer cell lines. BMC Genomics. 2009 Jun 23;10:277. doi: 10.1186/1471-2164-10-277.
Reinhold WC, Sunshine M, Liu H, Varma S, Kohn KW, Morris J, Doroshow J, Pommier Y. CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set. Cancer Res. 2012 Jul 15;72(14):3499-511. doi: 10.1158/0008-5472.CAN-12-1370.
Reinhold WC, Sunshine M, Varma S, Doroshow JH, Pommier Y. Using CellMiner 1.6 for Systems Pharmacology and Genomic Analysis of the NCI-60. Clin Cancer Res. 2015 Sep 1;21(17):3841-52. doi: 10.1158/1078-0432.CCR-15-0335. Epub 2015 Jun 5.
Luna A, Rajapakse VN, Sousa FG, Gao J, Schultz N, Varma S, Reinhold W, Sander C, Pommier Y. rcellminer: exploring molecular profiles and drug response of the NCI-60 cell lines in R. Bioinformatics. 2015 Dec 3. pii: btv701.
Rajapakse VN, Luna A, Yamade M, Loman L, Varma S, Sunshine M, Iorio F, Elloumi F, Aladjem MI, Thomas A, Sander C, Kohn KW, Benes CH, Garnett M, Reinhold WC, Pommier Y. CellMinerCDB for Integrative Cross-Database Genomics and Pharmacogenomics Analyses of Cancer Cell Lines. iScience, Cell Press. 2018 Dec 12.
Reinhold WC, Varma S, Sunshine M, Elloumi F, Ofori-Atta K, Lee S, Trepel JB, Meltzer PS, Doroshow JH, Pommier Y. RNA sequencing of the NCI-60: Integration into CellMiner and CellMiner CDB. Cancer Res. 2019 May 21. pii: canres.2047.2018. doi: 10.1158/0008-5472.CAN-18-2047.
Related links