To aid in the development and validatation of the techniques used in SeqExpress a number of test cases were designed and evaluated. These test cases where designed to ensure that the results of the techniques used in SeqExpress could be used to discover biological relevant information from gene expression data.
The documents below provide information on two such cases, one for SAGE data and one for gene chip data. The SAGE data experiment involved the assigning of putative function to unknown SAGE tags using functional enrichment of clusters. The gene chip data experiment involved the discovery of regulatory modules in yeast data sets using unsupervised learning.
A powerpoint presentation is available which introduced some aspects of SeqExpress.
Assigning putative functionality to SAGE tags using functional enrichment of clusters.
The fact that genes associated with parts of cellular machinery can be collectively controlled means that it is theoretically possible to assign phenotypic function to certain genes by analysis of their expression profiles. However the multi-faceted nature of genetic function, and the complexities of the signal and noise interactions that are inherent in a cell, means that whilst illustrations of such mediated control (for example the repression of translational machinery) have been well studied it is difficult to use them to reliably ascertain genetic function for individual genes. In this test case a semi-discrete decomposition hierarchy was generated, and then the Gene Ontology was used by SeqExpress to classify and then compare clusters from different sets of SAGE samples to discover which tags have a strong predisposition to reside in a particular type, or types, of cluster. This predisposition is then used to resolve ambiguities in tag assignment that have arisen by using sequence homology to predict protein products.
Details of the analysis and results are available as a pdf.
Discovering regulatory modules using unsupervised learning.
Regulatory modules and gene networks are starting to be identified using a combination of sequence and expression information mainly involving supervised learning. The combination of consensus sequence and expression information to determine correlations between regulatory patterns and regulatory binding sites is a powerful technique; however such correlations are only one of a mire of factors that can be used to interpret the experimental observations (others include RNA degradation, translational pauses, chromosome location, cell-cycle variability, our incomplete understanding). The purpose of this test case was to discover to what extent can such regulatory modules be determined using unsupervised clustering. That is to say, is it possible to predict a portion of the underlying regulatory mechanisms that define the behaviour of a cell by simply examining the geometry of the differing RNA levels that are elucidated through expression assay experiments?
Being able to cluster genes automatically into different regulatory groups would enable greater understanding of a series of expression experiments, and may lead to a greater understanding of the underlying protein production processes. If the gene regulatory patterns are to be modeled, then it is important to discover which model fits the data, rather than how to fit the data to the model. For this reason the clustering techniques must be robust, model defined, and should not require on any specific transformation procedures. In this test case a number of techniques were used including a modified probabilistic model which attempts to heat and cool the mixture of models until no further improvement is observed, this was done to obtain a better fitting of the residual to the defined models. The method used in SeqExpress uses the following procedure:
- Set B <-1 and generate a series of models by performing EM until the sum of the minimum residuals fails to decrease.
- Set B <- B*n where 1>n>0.
- Generate a series of models by performing EM at this lower B level until the sum of minimum residuals fails to decrease
- With B <-1, if the models generated in step 3 have a lower minimum residual try and refine the models at a lower level of beta (repeat step 2), otherwise perform step 5.
- Set B <- B/n where 1>n>0
- Generate a series of models by performing EM at this higher B level until the sum of minimum residuals fails to decrease
- With B <-1, if the models generated in step 6 have a lower minimum residual try and refine the models at a higher level of beta (repeat step 5), otherwise perform step 8.
- With B <-1 select the models with the lowest minimum residual sum from the set generated in steps {1, 4, 6}
One of the drawbacks to this system is that we are still finding the largely unknown, and as we are effectively verifying against predictions from another computational analysis (albeit a more complete analysis), it is therefore probable that the success of the predictions are due to their ability to mimic the behaviour of the other clustering technique rather than the actual regulatory modules themselves.
Details of the analysis and results will be available soon.