#### Basic Steps

- Prepare dataset (One Matrix)
- Prepare dataset (Two Matrix)
- Default values
- Selecting appropriate options
- Understanding the result

#### Prepare Dataset (One Matrix)

The input file must be in **Tab delimited text file format** (saved with a .txt extension).

Example below:

*File Name: *mydata.txt

column 1 | column 2 | column 3 | |

row 1 | -2 | 4 | 5 |

row 2 | na | 2 | 1 |

row 3 | 4 | 6 | - |

The file includes data, row name and column name. The first column should be row names. And the first row should be column names. The value in the cell in the first row and first column (left blank in the above example) will be ignored.

Missing values are accepted and should be indicated by an 'na' (row 2/column 1 in the above example) or a hyphen (row 3/column 3). Empty cells will also be accepted

We use a period (full stop) as a decimal point. Using a comma will result in errors, since we use commas as list separators. For the same reason, please do not use commas to separate digits in large numbers. For example, numbers should be written as "123456.78" not "123456,78" or "123,456.78" or "123,456,78".

With fewer than 3 rows or columns the clustering algorithm cannot provide any useful information.

#### Prepare Dataset (Two Matrix)

Two matrices are used as input in this case, one NxP and another PxM. From these, a third matrix (the product matrix) of size NxM is created where element (i,j) is the correlation between the ith row of the first the jth column of the second matrix. A CIM for the product matrix is created which is colored according to the elements in the various rows and columns.

The rows and/or columns of the product matrix can be clustered to bring out patterns, but here the clustering is done based on the rows of the first input matrix and the columns of the second input matrix. The rows are reordered by clustering the rows of the first input matrix and this reordering is used for the product matrix. Similarly the clustering of the columns of the second input matrix gives the reordering of the columns of the product matrix.

The Two Matrix algorithm takes two files as input, the row data file and the column data file.
**The number of columns in the first data file should be the same number and order as the number of rows of the second data file.**
If the first data file has M rows and P columns, the number of rows of the second data file must be P.
See example below where the columns in first data file are of the same number (3) and order as the rows of second data file.

*First data file*

column 1 | column 2 | column 3 | |

row 1 | -2 | 4 | 5 |

row 2 | na | 2 | 1 |

row 3 | 4 | 6 | - |

row 4 | 5.3 | 3.4 | -2.3 |

*Second data file*

column 1 | column 2 | column 3 | column 4 | column 5 | |

row 1 | -1.2 | 3.8 | 5.1 | 4.6 | 5.6 |

row 2 | -3.4 | na | 6.7 | 1.4 | 2.6 |

row 3 | - | 4.3 | 3.4 | -3.9 | 5 |

The input files must be in **Tab delimited text file format** (saved with a .txt extension).
The files include data, row names and column names. The first column should be row names and the first row should be column name.
The value in the cell in the first row and first column (left blank in the above example) will be ignored.

The file includes data, row name and column name. The first column should be row names. And the first row should be column names. The value in the cell in the first row and first column (left blank in the above example) will be ignored.

Missing values are accepted and should be indicated by an 'na' (row 2/column 1 in the above example) or a hyphen (row 3/column 3). Empty cells will also be accepted

We use a period (full stop) as a decimal point. Using a comma will result in errors, since we use commas as list separators. For the same reason, please do not use commas to separate digits in large numbers. For example, numbers should be written as "123456.78" not "123456,78" or "123,456.78" or "123,456,78".

With fewer than 3 rows or columns the clustering algorithm cannot provide any useful information.

#### Default values

For ease of user entry, the system has selected the most common user choices as the default values for the order choice, distance method, cluster algorithm and binning algorithm. The default is to cluster both rows and columns using the Euclidian distance method and the average linkage cluster algorithm. The default binning method is equal width. To change these options, the user clicks on the advanced options radio button and these options will appear. The various choices are explained in more detail below.

#### Selecting appropriate options

Selecting one of the order choices will determine the order the output apprears. If you want like data to be grouped, then choose "Cluster". For the computer to randomly order your data then choose "Randomize". To have your results appear in the order specified in your original file, select "No cluster". You must specify the order for each axis.

If you select cluster as order choice, you have to select a cluster algorithm and a distance method. Otherwise, skip this section.

The **distance method** quantifies the measure of dissimilarity between two data vectors.

**Correlation**distance uses 1-ρ as the distance where ρ is the correlation of two vectors.**Euclidean**distance uses the square root of the sum of squared differences of the coordinate values.**Manhattan**distance uses the sum of absolute differences of the coordinate values. Each coordinate is first normalized first (to have standard deviation 1).**Maximum**distance uses the maximum absolute difference in all coordinates.**Absolute correlation**distance uses the absolute value of the correlation between the items.**Canberra**distance is similar to the Manhattan distance except that the absolute differences of the coordinate values are divided by the sum of the absolute values of the coordinates before summing. Note that the Canberra distance cannot be used if any of the values in the input file are negative.**Jaccard index**computes the dissimilarity between two sets, i.e. collections of elements. Each row (or column) indicates whether the entity representing each column (or row) belongs or does not belong to the entity representing the row. So, for example, a 1 in row 10, column 20 can imply that gene 20 belongs to category 10, while a 0 implies that gene 20 does not belong to category 10.**Minkowski**distance is a generalization of the Euclidean distance where the differences of the coordinate values are taken to the p-th power, summed and then the p-th root taken of the sum.**Cosine**distance is computed like the correlation distance except that the two vectors are not normalized to have zero mean.

The **cluster algorithm** specifies the linkage method used by the hierarchical clustering algorithm to
determine the distance between cluster groups.

**Average linkage**defines the distance as the average of all pairs from each cluster group.**Single linkage**defines the distance as the minimum of all pair-wise distances between elements in the two clusters.**Complete linkage**defines the distance as the maximum of all pair-wise distances between elements in the two clusters.**Ward**method uses the sum of squared distances from the centroid as a measure of the tightness of the clustering. At each step, the two clusters that are combined together are those that lead to the smallest increase of this measure after combination.**Mcquitty**method computes the distance between two clusters as the sum of distances of the sub-clusters that make each of the two clusters, weighted by the number of elements in each sub-cluster.**Median**method combines clusters whose medians are the closest.**Centroid**method combines clusters whose centroids are the closest.

The distance method and cluster algorithm can be chosen separately for each axis.

The **binning method** is used to specify the method to map the data values to colors for displaying the CIM

**Equal width**divides the weightrange of data values (from minimum to maximum) into equal width intervals. Each interval is mapped to one color.**Quantile**divides the weightrange of data values into intervals each with approximately the same number of data points. This effectively spreads out the color differences between data values that are present in regions with a large number of values.

#### Understanding the result

The result has four frames. The left frame contains a list of the X axis elements, in the order that they appear on the X axis of the image (from left to right). The right frame contains a list of the Y axis elements, in the order that they appear on the Y axis of the image (from top to bottom). These two frames also contain links to display a separate image of the merge height plot. The main frame, in the middle, contains your input file name(s), a link to download a data file containing the raw data used to create the image, the image itself and a row of buttons that allow you select various ways to update your image. The image is a gif file.

You may reformat the image by clicking the button "Color", "Binning", "Zoom", "Axes", and "Page Layout" which will open up a new section of the page where you can select choices that relate to the category specified.