Refining clusters

Once clusters have been identified, they can be refined by:

By selecting the Refine Clusters->By Ontology Terms option the clusters are examined to see if any have a significant level of one or more ontology terms. If the information is available, it is also possible to select groups by their genome location or their protein-protein interactions. The extraction of the ontology terms and calculation of scores takes a few minutes, the process of this ontology scoring can be seen in the bottom progress bar of the main SeqExpress window (or through the File->Monitor Dialog). It is possible to select the sets of clusters that are to be refined. If more than one set is selected this will still result in only one set of refined clusters being generated. By selecting the only report clusters that have high ontology term enrichment box, only clusters which have a significant concentration of genes with a specific ontology term will be returned. The refined cluster is added to the Gene List in the bottom right of the main SeqExpress window.

By selecting the Tools->Refine Clusters->By Distribution option sub distributions within the data sets are identified. A set of clusters is used as the starting point for the sub distribution identification.These sub-distributions are described using models, a number of different models are available . The fitting of the data to the models can also be alter using an energy parameter . The models are fitting using Expectation Maximisation . A number of options are available for controlling both the calculation of the models and for the generation of the clusters from these models. When calculating the models it is possible to define:

Energy Modifier: this alters the residual by the given amount, this can be used to alter the effects on the models. By raising the value the models will more closely describe those genes that have lower residuals (e.g. if gene 1 has a residual of 5 and gene 2 has a residual of 10, with an energy modifier of 2 these will become 25 and 100), by lowering it the models attempt to describe more genes that have higher residuals

Tempering: this automatically changes the energy parameter to try and find the 'best' value (see below)

Merging Models: this will remove models that are have low probabilities and see if this improves the overall predictions of the other models.

When generating clusters two options are available:

Partition clusters: genes are assigned to the cluster/model for which they have the highest probability of being described by.

Use cut off: genes are assigned to a cluster/model if they have a probability of being described by it greater than the given cut off

Validating clusters.

A C-Index for the clusters can be generated by selecting Tools->Validate Clusters->Calculate C-Index. The C-Index is a measure of how well the closest items have been clustered. The C-Index is (Sum-Min)/(Max-Min), where Sum is the Sum of the all the distance between items within a cluster (for N items there will be N*N/2 distances), the Min is the sum of N*N/2 of the smallest distances between all items and Max is the sum of N*N/2 of the largest distances between all items. For the best cluster Sum->Min and so the C-Index will be small.

A variety of distance measures are available, and so it is possible to find out:

how well clustering algorithms have worked and how altering parameters effects the clustering (there able to find 'best parameters')
the similarity between the distance measures (clustering with one distance measure and validating with another). For example we can cluster using a manhattan distance, and then validate using a gene-ontology distance measure (to discover to what extent genes with similar terms have similar expression profiles)

It is important to note that a cluster with a low C-Index does not mean it is biologically meaningful, and that depending on the distance measure validation can require differing resources (for a 10,000 genes 500MByte of memory may be needed).

The results of the C-Index validation can be viewed using the Cluster Analysis tool.