Clustering

Clusters can be identified using five different mechanisms within SeqExpress:

The results of each analysis are added to the Gene Tab as a new item, the clusters can be viewed by selecting the (+) symbol. (see opposite).

SeqExpress also provides tools for refining and validating clusters, calculating projections (using covariance or cooccurance) and transforming the data.

By Model

Further information about how the models are generated is available. There are three stages in generating models:

Once the models have been generated they will be placed in the model tab. It is then possible to fit additional data sets to the model (e.g. fit a set of disease experiments to a set of normal/control experiments). To do this select the Edit->Models->Fit Data and then choose the model you wish to use and then the set of experiments you wish to have fitted. Once the data sets have been fitted to the model and the probabilities for each calculated the results will be placed in the model tab in a tree beneath the original model. Every time a fitting operation is performed new probabilities are calculated which reflect how similar the data is to each of the models (see example table below).

Gene Model 1 Model 2
YAL1 0.5 0.5
YAL2 0.3 0.7
YAL3 0.9 0.1
...    
YALn 0.6 0.4
Gene Model 1 Model 2
YAL1 0.4 0.6
YAL1 0.7 0.3
YAL1 0.4 0.6
...    
YALn 0.8 0.2
Original data used to create the models. The probabilities correspond to how much the gene expression values effected the model parameters. As these are probabilities they sum to 1.
The probabilities for a second set of data can be generated by fitting the data to the models. The models are unchanged, and the probability reflects how similar the data is to each of the models.

When you have generated the models (and optionally fitted additional experiment data to them), you can generate clusters. Clusters can be generated:

Gene Model 1->1 Model 1->2 Model 2->1 Model 2->2
YAL1 0.2 0.3 0.2 0.3
YAL2 0.21 0.09 0.49 0.21
YAL3 0.36 0.54 0.04 0.06
...        
YALn 0.48 0.12 0.32 0.08
Probabilities describing the behaviour of genes in both data sets:. Model1->1 is the probability of the gene being associated with Model 1 in the first data set, and Model 1 in the second data set; Model 1>2 is the probability of the gene being associated with Model 1 in the first data set and Model 2 in the second data set; and so on.

The purpose of assigning genes to these clusters is to organize genes into clusters which show those genes whose behaviour has altered in some significant way (e.g. in the control experiments the gene behaves like a gene that was active in G1 phase, and in the disease tissue experiments it behaves like a house keeping gene) as well as those whose behaviour has remained the same. You can visualise both the fitted and original data (using the View->View Model menu, and selecting both the original model and fitted data), and study how genes have migrated between behaviours. It is interesting to note that we generally see the following types of behaviour:

It is worth noting the crudeness of this approach, we are simply using the expression profiles to define the models and are not using other aspects of biological knowledge. We can extend the approach to bring in other aspects of biological knowledge - we can combine our knowledge of gene location and gene expression to produce clusters which represent genes which have a high probability of being co-expressed.

The models and the data fitted to them can be viewed using the probability viewer.

 

By Graph

Further information about how the graphs are generated is available. There are four stages in generating a graph:

Once the graph has been generated it will be placed in the graph tab. From the graph tab popup menu or the Edit-> Graph->Generate Clusters menu you can generate partitions (and thus clusters) based on the generated graph. When generating the clusters you need to choose the both the method for the partitioning (do you want to have irregular or regular sized clusters) and the approximate size of the clusters (if you select to find irregular sized partitions then the size is the minimum size of a partition). Regular sized partitions are found using Metis, irregular sized partitions are found using an inbuilt max/min algorithm.

Once the clusters have been found they will be placed in the gene tab.

By Hierarchy

Further information about how the hierarchies are generated is available. To generate a hierarchy you need to choose the method you wish to use from the Edit->Hierarchies menu, the choice is between:

Once the hierarchy has been generated it will be placed in the graph tab (in the hierarchy table). From the graph tab popup menu or the Edit-> Hierarchy->Generate Clusters menu you can generate clusters. You need to specify the desired cluster size, the clusters are generated by descending the tree until the cluster size is less than or equal to the required size (the search is cut when clusters of size 1 are encountered).

The hierarchies themselves can be visualised using the hierarchy viewer.

Once the clusters have been found they will be placed in the gene tab.

By Heuristic

Two heuristic based approaches are available within SeqExpress:

SAGE clusters

A modified version of (gaussian-normal based) expectation maximization is used to find sub distributions within the data loaded into SeqExpress.This method is designed to work with SAGE data (preferably globally ranked)

The system attempts to find the optimal number of sub distributions through a number of different iterations. It detects an optimal number of starting distributions, and then once convergence has been reached the process is halted. Genes are then mapped to the distributions they 'fit best', and the resulting sets of genes are added to the selection of genes within SeqExpress.

This EM implementation has four different modes:

Location and expression combined probability

The graph model is a useful tool for combining information about two facets of gene behaviour (e.g. the gene function and its expression profile), however it is limited to only two items of information. When studying gene behaviour it is obvious that we would wish to go beyond two such factors (e.g. combining information about gene location, codon usage, proposed gene function and use a number of sets of expression experiments), one mechanism that will support this is probabilities (and networks of probabilities). Such a mechanism has been used to build an example analysis method, which defines both the distance between genes on a chromosome and gene expression similarity as a probability - we then combine these probabilities to identify sets of genes with similar behaviour.

You can select this method by using the Tools->Clusters->By Heuristic->Use location and expression profiles...

This method is still a prototype.

 

By Distance Measure

Depending on the experiment type different distance methods should be used: for SAGE data a similarity (Euclidian distance) search which is both anchored and seeded with 'real values' is generally suitable; for gene chip data a slope (gradient) search which is not seeded from real values can be used. Overall SeqExpress provides seven mechanisms for clustering genes into different non-overlapping sets of genes.

More information about distance measures is available.

For some of the above mechanisms a number of different customizable options are available including:

  1. the number of starting clusters is an indication of the size, during iterations this number may be decreased if the size of the cluster is less than a user defined size.
  2. the centroid for the cluster can be anchored to the most similar gene within a cluster.
  3. the starting clusters can either be chosen from the set of real values or randomly generated.

An anchoring process keeps the definition for the cluster closer to that of real expression profiles, and will help alleviate problems associated with outliers. The clustering is started by selecting Tools->Find Clusters->Using distance measures. There are a number of options available to customize the clustering. The following options are available to control both the starting clusters and the cluster refinement iterations:

The refinement process is repeated until convergence has been reached. The results are then added to the gene list on the main page.