Improving Novel Gene Discovery in High-Throughput Gene Expression Datasets
Abstract
High-throughput gene expression datasets (including RNA-seq and microarray datasets) can quantify the expression level of tens of thousands of genes in an organism, which allows for the identification of putative functions for previously unstudied genes involved in treatment/condition responses.
For static (single timepoint) high-throughput gene expression experiments, the most common first analysis step to discover novel genes is to filter out genes based on their degree of differential expression and the amount of inter-replicate noise. However, this filtering step may remove genes with very high baseline expression levels, and genes with important functional annotations in the experiment being studied. Chapter 2 presents a novel knowledge-based clustering approach for novel gene discovery, in which known functionally important genes as well as genes with very high expression levels (which would typically be removed by a strict fold change filter) are saved prior to filtering.