Qian Zhu will present his Pre-FPO Friday, April 24, 2015 in rm 401 at 10am. The members of his committee are: Olga Troyanskaya (adviser), Mona Singh, Kai Li, Moses Charikar, and Vessela Kristensen (University of Oslo, Norway) Title: Exploring gene similarities in large-scale compendia Public gene expression datasets have been rapidly accumulating, yet there is currently a lack of effective tools in reusing existing datasets and exploring them for targeted analyses. One of the most central analyses is discovering similar genes in a data-driven way through clustering, biclustering, and query-search approaches. This thesis explores these issues and is divided into three parts. In the first part, we describe a fast, content-based, query-based search solution for discovering gene similarities in large data collections. The second part of this thesis then explores different applications of the search tool, including the role of coexpression in leading to regulatory factor discovery and biological network inference when it is being integrated with secondary data. Lastly, we extend work from a recently developed biclustering approach that discovers low-rank submatrix from a large matrix, and we explore its usage and describe a modification that can enumerate all coregulated genes and conditions. We first present the developed system SEEK, a web-based tool and a novel algorithm for exploring and visualizing gene expression patterns from thousands of datasets. Through flexible multi-gene queries, the system uses gene “hubbiness” correction procedure for balancing well-coexpressed genes in the results. It uses an automatic, data-driven dataset weighting algorithm for filtering irrelevant datasets, up-rank datasets which may be relevant to the user’s query, on the basis of coexpression of query genes. This algorithm thus eliminates the need to manually find relevant datasets (which is impossible), and at the same time achieves query-sensitivity. Notably, through robust search of thousands of human datasets, the retrieval of functionally co-annotated genes always improves with the inclusion of more datasets, showing the promise of the large compendia. We extend the function of SEEK to 5 other model organisms, in a new system called modSEEK, with the goal of enabling accurate searches in a wider experimental variety. In terms of the applications, we show that SEEK and modSEEK, when integrated with secondary Chip-seq compendia, are accurate in uncovering upstream transcription factor (TF) and their relationships. These are summarized in a TF regulatory network which carries value in hypothesis generations and directing follow-up studies. Regulatory associations between TF’s are accurate according to in vitro motif evidence. Biclustering is a problem that involves computationally intensive calculations. Prior work operates on binary version of the matrix, which may incur a small loss of resolution. Finding multiple biclusters usually involves masking the data with random values which prevents finding overlapping biclusters. At the same time, this solution is computationally inefficient since masking the data and repeating the procedure does not reduce problem size. I thus develop a recursive, divide-and-conquer approach that extends the recently developed algorithm on finding low-rank submatrix (see Aaditya V. Rangan et al). The proposed algorithm applies his algorithm on large-scale and finds multiple low-rank biclusters. The approach finds high-resolution small biclusters and combines them through post-processing. This algorithm is fast and capable of enumerating all biclusters in the dataset.