Refining clusters

Once clusters have been identified, they can be refined by:

By selecting the Refine Clusters->By Ontology Terms option the clusters are examined to see if any have a significant level of one or more ontology terms. If the information is available, it is also possible to select groups by their genome location or their protein-protein interactions. The extraction of the ontology terms and calculation of scores takes a few minutes, the process of this ontology scoring can be seen in the bottom progress bar of the main SeqExpress window (or through the File->Monitor Dialog). It is possible to select the sets of clusters that are to be refined. If more than one set is selected this will still result in only one set of refined clusters being generated. By selecting the only report clusters that have high ontology term enrichment box, only clusters which have a significant concentration of genes with a specific ontology term will be returned. The refined cluster is added to the Gene List in the bottom right of the main SeqExpress window.

By selecting the Tools->Refine Clusters->By Distribution option sub distributions within the data sets are identified. A set of clusters is used as the starting point for the sub distribution identification.These sub-distributions are described using models, a number of different models are available . The fitting of the data to the models can also be alter using an energy parameter . The models are fitting using Expectation Maximisation . A number of options are available for controlling both the calculation of the models and for the generation of the clusters from these models. When calculating the models it is possible to define:

When generating clusters two options are available:

Validating clusters.

A C-Index for the clusters can be generated by selecting Tools->Validate Clusters->Calculate C-Index. The C-Index is a measure of how well the closest items have been clustered. The C-Index is (Sum-Min)/(Max-Min), where Sum is the Sum of the all the distance between items within a cluster (for N items there will be N*N/2 distances), the Min is the sum of N*N/2 of the smallest distances between all items and Max is the sum of N*N/2 of the largest distances between all items. For the best cluster Sum->Min and so the C-Index will be small.

A variety of distance measures are available, and so it is possible to find out:

It is important to note that a cluster with a low C-Index does not mean it is biologically meaningful, and that depending on the distance measure validation can require differing resources (for a 10,000 genes 500MByte of memory may be needed).

The results of the C-Index validation can be viewed using the Cluster Analysis tool.