Untitled Document

By Model: Models are built and genes with a high likelihood of belonging to a specific model are clustered.
By Graph: Graphs are built that represent either genome location, functional similarity or expression profile similarity, these are then partitioned.
By Hierarchy: Trees representing the hierarchical relationships between the genes can be constructed and then partitioned.
By Heuristics : Specialised workflows have been designed to identify difficult to find clusters (e.g. SAGE data, genes with similar genome locations and expression profiles)
By Distance Measure: Clusters are derived using inter gene distance.

The results of each analysis are added to the Gene Tab as a new item, the clusters can be viewed by selecting the (+) symbol. (see opposite).

By Model

Choose how the initial models will be generated, this can be done by using either a pre-defined set of clusters (by selecting the Edit->Models->Refine Clusters... menu) or you can generate them from scratch (by selecting the Edit->Models->Generate Clusters... menu).
Choose the number of models, this is defaulted to be twice the number of parameters/experiments. Once the models have been generated, looking at the probabilities will give an idea if there were too few models (or too many).
Choose the type of model, you have a choice of either Euclidean based models (designed for experiments with similar uniform distributions) or Normal-based distribution models (suitable for experiments which do not have similar uniform distributions).

Once the models have been generated they will be placed in the model tab. It is then possible to fit additional data sets to the model (e.g. fit a set of disease experiments to a set of normal/control experiments). To do this select the Edit->Models->Fit Data and then choose the model you wish to use and then the set of experiments you wish to have fitted. Once the data sets have been fitted to the model and the probabilities for each calculated the results will be placed in the model tab in a tree beneath the original model. Every time a fitting operation is performed new probabilities are calculated which reflect how similar the data is to each of the models (see example table below).

Gene	Model 1	Model 2
YAL1	0.5	0.5
YAL2	0.3	0.7
YAL3	0.9	0.1
...
YALn	0.6	0.4

Gene	Model 1	Model 2
YAL1	0.4	0.6
YAL1	0.7	0.3
YAL1	0.4	0.6
...
YALn	0.8	0.2

Original data used to create the models. The probabilities correspond to how much the gene expression values effected the model parameters. As these are probabilities they sum to 1.

The probabilities for a second set of data can be generated by fitting the data to the models. The models are unchanged, and the probability reflects how similar the data is to each of the models.

When you have generated the models (and optionally fitted additional experiment data to them), you can generate clusters. Clusters can be generated:

For a model and original data set. One cluster is generated for each model, and the highest probability for each gene is used to decide which cluster it belongs to.
For a model and fitted data. As all model data is defined as probabilities we can use these probabilities to describe the behaviour of the genes in both data sets - see table below. We can then cluster genes by their likelihood of having a certain behaviour. As we wish to find genes who have a preference for having a specific behaviour, we assign genes to a cluster (which represents all genes with that behaviour) where their highest probability is a certain factor greater than their next highest probability. For example, in the table below YAL1 and YALn would not be assigned to any cluster, YAL2 would be assigned to a cluster corresponding to Model2->1 (that is to say the cluster corresponding to all genes whose expression profile for the control experiments is best described by Model 2 and for the fitted/disease experiments is best described by Model 1), and YAL3 would be assigned to the Model 1->2 cluster.

Gene	Model 1->1	Model 1->2	Model 2->1	Model 2->2
YAL1	0.2	0.3	0.2	0.3
YAL2	0.21	0.09	0.49	0.21
YAL3	0.36	0.54	0.04	0.06
...
YALn	0.48	0.12	0.32	0.08

Probabilities describing the behaviour of genes in both data sets:. Model1->1 is the probability of the gene being associated with Model 1 in the first data set, and Model 1 in the second data set; Model 1>2 is the probability of the gene being associated with Model 1 in the first data set and Model 2 in the second data set; and so on.

The purpose of assigning genes to these clusters is to organize genes into clusters which show those genes whose behaviour has altered in some significant way (e.g. in the control experiments the gene behaves like a gene that was active in G1 phase, and in the disease tissue experiments it behaves like a house keeping gene) as well as those whose behaviour has remained the same. You can visualise both the fitted and original data (using the View->View Model menu, and selecting both the original model and fitted data), and study how genes have migrated between behaviours. It is interesting to note that we generally see the following types of behaviour:

Large numbers of genes tend to stay associated with the same model, those that have a high probability of belonging to a model will generally be associated with the same model when new data is fitted (and so have similar expression profiles).
Some models do not reflect controlled expression. Some models simply do not reflect any biological phenomena associated with gene expression, in this case the genes contained within a model for the original data set will simply be randomly associate with different models for the fitted data set.
Migration behaviour between models generally occurs en-masse (the majority of the genes that were define with a specific model, will now be defined by another model). This represents similar changes in the expression profiles of the set of genes, and is of great interest when attempting to ascertain differences between the original and diseased data sets.
Migration can be split between two models (half of the genes migrate to one model, the other half either remain within the same model or migrate to a third). Again this represents similar changes in expression profiles, but one of the models has encapsulated two very different types of behaviour (which probably were not noticeable in the original data set)

It is worth noting the crudeness of this approach, we are simply using the expression profiles to define the models and are not using other aspects of biological knowledge. We can extend the approach to bring in other aspects of biological knowledge - we can combine our knowledge of gene location and gene expression to produce clusters which represent genes which have a high probability of being co-expressed.

The models and the data fitted to them can be viewed using the probability viewer.

By Graph

Choose the type of graph to construct by select the corresponding Edit->Graph option (selecting either minimal spanning tree, ontology graph or location graph menu).
Choose the gene expression experiments you wish to use (depending on your later options this data may not be used in the graph construction). By default all the experiments are chosen.
Choose the distance measure that will be used to calculate the length of the edges between the graph nodes.

Once the graph has been generated it will be placed in the graph tab. From the graph tab popup menu or the Edit-> Graph->Generate Clusters menu you can generate partitions (and thus clusters) based on the generated graph. When generating the clusters you need to choose the both the method for the partitioning (do you want to have irregular or regular sized clusters) and the approximate size of the clusters (if you select to find irregular sized partitions then the size is the minimum size of a partition). Regular sized partitions are found using Metis, irregular sized partitions are found using an inbuilt max/min algorithm.

By Hierarchy

Further information about how the hierarchies are generated is available. To generate a hierarchy you need to choose the method you wish to use from the Edit->Hierarchies menu, the choice is between:

Semi Discrete Decomposition: if you choose this method you will have to select the number of factors you wish to find (the default value is generally fine, rasing it will generate a more complex/deep tree). With SDD, each node in the tree has three children.
Hierarchical analysis: if you choose this method you will have to select the distance measure that will be used (the distance between groups of genes is based on their average distance, not on the max or min distance). With hierarchical analysis, each node in the tree has two children.

Once the hierarchy has been generated it will be placed in the graph tab (in the hierarchy table). From the graph tab popup menu or the Edit-> Hierarchy->Generate Clusters menu you can generate clusters. You need to specify the desired cluster size, the clusters are generated by descending the tree until the cluster size is less than or equal to the required size (the search is cut when clusters of size 1 are encountered).

By Heuristic

A SAGE small clustering methods, which is suitable for finding clusters from SAGE experiments.
A location and expression probability model, which finds clusters of co-expressed genes by combining the probability of genes reflecting gene position in a chromosome with probabilities represent gene expression similarity.

SAGE clusters

A modified version of (gaussian-normal based) expectation maximization is used to find sub distributions within the data loaded into SeqExpress.This method is designed to work with SAGE data (preferably globally ranked)

The system attempts to find the optimal number of sub distributions through a number of different iterations. It detects an optimal number of starting distributions, and then once convergence has been reached the process is halted. Genes are then mapped to the distributions they 'fit best', and the resulting sets of genes are added to the selection of genes within SeqExpress.

No filters - works through cycles until convergence is reached.
Large Distribution Removal - attempts to remove distributions with large standard deviations and merge similar distributions.
Bump detection - ensures that bumps are not removed during EM iterations
All of the above - both large distributions are filtered and bumps are restored.

Location and expression combined probability

The graph model is a useful tool for combining information about two facets of gene behaviour (e.g. the gene function and its expression profile), however it is limited to only two items of information. When studying gene behaviour it is obvious that we would wish to go beyond two such factors (e.g. combining information about gene location, codon usage, proposed gene function and use a number of sets of expression experiments), one mechanism that will support this is probabilities (and networks of probabilities). Such a mechanism has been used to build an example analysis method, which defines both the distance between genes on a chromosome and gene expression similarity as a probability - we then combine these probabilities to identify sets of genes with similar behaviour.

You can select this method by using the Tools->Clusters->By Heuristic->Use location and expression profiles...

By Distance Measure

Depending on the experiment type different distance methods should be used: for SAGE data a similarity (Euclidian distance) search which is both anchored and seeded with 'real values' is generally suitable; for gene chip data a slope (gradient) search which is not seeded from real values can be used. Overall SeqExpress provides seven mechanisms for clustering genes into different non-overlapping sets of genes.

Euclidian or Manhattan Distance Measure - finds genes whose expression profiles have similar values (the Euclidian or Manhattan distance is small)
Peaks - finds genes that occur in unusually high/low expression profiles (peaks).
KMeans - finds genes using KMeans clustering.
Pearson or cosine distance measure - finds genes that have expression profiles that are of a similar shape
Gradient distance - this is similar to a cosine distance measure, but it assumes the order of the experiments if important and uniform (e.g. time series), it finds genes whose rate of change/gradient are similar.

For some of the above mechanisms a number of different customizable options are available including:

the number of starting clusters is an indication of the size, during iterations this number may be decreased if the size of the cluster is less than a user defined size.
the centroid for the cluster can be anchored to the most similar gene within a cluster.
the starting clusters can either be chosen from the set of real values or randomly generated.

An anchoring process keeps the definition for the cluster closer to that of real expression profiles, and will help alleviate problems associated with outliers. The clustering is started by selecting Tools->Find Clusters->Using distance measures. There are a number of options available to customize the clustering. The following options are available to control both the starting clusters and the cluster refinement iterations:

Number of clusters: This specifies the starting number of clusters which will be used to seed the algorithm.
Use best guess: This specifies that SeqExpress should define the number of starting clusters, at present this will find larger clusters.
Seed From Real Values: The generated starting points for the clusters will be chosen from the set of expression profiles,or can be randomly generated. This is a useful technique if dealing with non parametric data (such as SAGE) as the distribution of starting clusters more accurately reflects the underlying data. If not selected then the initial centroids will be randomly distributed between the maximum and minimum values of each expression experiment.
Minimum size: this specifies the minimum size that a cluster should be. During an iteration if a cluster has less than this number then the cluster is removed.
Anchor clusters: If selected then the centroids for a cluster are adjusted at each iteration to match the closest matching gene in the cluster. Such an anchoring process keeps the definition for the cluster closer to that of real expression profiles, and can help alleviate problems associated with outliers.

The refinement process is repeated until convergence has been reached. The results are then added to the gene list on the main page.