Computational approaches for the DNA sequencing data deluge
Ben Langmead,
University of Maryland College Park
Tuesday, March 6, 2012, 4:30pm
Computer Science Small Auditorium (Room 105)
Second-generation DNA sequencers are improving rapidly and are now
capable of sequencing hundreds of billions of nucleotides of data in
about a week for a few thousand dollars. Consequently, sequencing has
become a common tool in many fields of life science. But with these
developments comes a problem: growth in per-sequencer throughput is
drastically outpacing growth in computer speed. As the throughput gap
widens over time, the crucial research bottlenecks are increasingly
computational: computing, storage, labor, power.
these lines, I will discuss collaborative scientific projects in
epigenetics and gene expression profiling for which I provided novel
computational methods in areas such as read alignment, text indexing,
and data-intensive computing. I will also discuss a new set of methods
for very time- and space-efficient alignment of sequencing reads: Bowtie
and Bowtie 2. These tools build on the insight that the
Burrows-Wheeler Transform and the FM Index, previously used for data
compression and exact string matching, can be extended to facilitate
fast and memory-efficient alignment of DNA sequences to long reference
genomes such as the human genome.
Ben Langmead is a Research Associate in the Department of Biostatistics
at the Johns Hopkins Bloomberg School of Public Health. He completed
his Ph.D. in Computer Science in February 2012 at University of
Maryland, advised by Steven L. Salzberg. His research addresses
problems at the intersection of computer science and genomics, and he is
the author of several open source software tools for analysis of
high-throughput genomics data, including Bowtie, Bowtie 2, Crossbow and
Myrna. His paper describing Bowtie won the Genome Biology award for
outstanding paper published in 2009. At Johns Hopkins, he collaborates
with biostatisticians, biomedical engineers, biologists, and other
computer scientists to develop methods for analyzing second-generation
DNA sequencing data.