Recent advances in RNA sequence analysis
Center for Bioinformatics and Computational Biology and Department of Computer Science, University of Maryland, College Park, MD 20742, USA
The electronic version of this article is the complete one and can be found at: http://f1000.com/reports/b/2/64
The latest high-throughput DNA sequencing technology can now be applied on a large scale to capture the complete set of mRNA transcripts in a cell, using a technique called RNA-seq. Although RNA-seq is only 2 years old, it has rapidly swept through the field of genomics, and it is now being used to analyze the transcriptomes of organisms ranging from bacteria to primates. The depth of sequencing allows researchers to quantify the level of expression of genes, to discover alternative isoforms in eukaryotic species, and even to characterize the operon structure of bacterial genomes.
Introduction and context
Sequencing the mRNA in a cell has been used as a high-throughput method for finding genes since the early days of the human genome project. Beginning in the early 1990s, the expressed sequence tag (EST) method was used to capture fragments of thousands of human genes  prior to the sequencing of the genome. EST sequencing relies on the fact that eukaryotic genes are polyadenylated after transcription, and the long poly-A tract can be used to capture the transcripts via reverse transcription PCR (RT-PCR). The EST method was subsequently applied to many other species, and EST databases (notably dbEST) became a vital resource for genome annotation. Recently, a ‘next-gen’ version of EST sequencing has emerged, allowing researchers to capture and sequence mRNA at dramatically lower cost, and higher volume, than was ever possible with the EST method. The new RNA-seq methods [2-5] are being applied to a rapidly growing variety of species, cell types, and scientific questions, revealing far more about the transcriptomes of these species than was known just a few years ago. The field is advancing so rapidly that a brief review cannot cover the work of the past 2 years; this review is just a sampling of a few highlights.
Major recent advances
Sultan et al.  analyzed approximately 8 million short reads and found that RNA-seq could detect 25% more genes as compared to microarrays. About one-third of transcripts in their experiments mapped to genomic regions not annotated as genes. Of the 94,241 splice junctions, 4096 were novel, and many of these indicated exon skipping events. This result has been amplified by subsequent studies that generated even more sequences and showed even larger numbers of novel splicing events. Trapnell et al.  generated approximately 430 million paired-end reads to recover 13,692 known isoforms from mouse myoblast cells, but also detected 12,712 novel isoforms, of which 7395 contained novel splice junctions while the rest represented novel combinations of known exons. This latter study also demonstrated the power of a new algorithm capable of detecting and quantifying alternative isoforms when aligning RNA-seq reads to a genome. In an RNA-seq study using liver RNA samples from humans, chimpanzees, and rhesus macaques, Blekhman et al.  found that alternative splicing events vary between closely related primates and also between the sexes within species. Wang et al.  generated approximately 600 million short reads from 15 cell types and found that 92-94% of human genes are alternatively spliced, and that many alternative splicing events are tissue-specific. RNA-seq is also being used to study genetic variation among individuals (expression quantitative trait loci, or eQTLs). Pickrell et al.  and Montgomery et al.  combined RNA-seq data and HapMap data from 69 Nigerian individuals and 63 Caucasian individuals, respectively, and both groups identified variants responsible for alternative splicing as well as variation in expression levels among individuals.
In single-celled organisms, RNA-seq can reveal novel insights about polycistronic transcripts. In the first transcriptome analysis of Trypanosoma brucei, thousands of splicing and polyadenylation sites were identified and many genes were found to be differentially expressed between the parasite's two life-cycle stages . In prokaryotes, RNA-seq can provide an extremely detailed transcription map, at the single-base level, as has been shown recently in an archaeal species, Sulfolobus solfataricus, and in a pathogen bacterium, Helicobacter pylori. In S. solfataricus, over 1000 transcriptional start sites were detected and 80 novel protein-coding genes were discovered . In H. pylori, hundreds of transcriptional start sites within operons were found, as well as approximately 60 novel small RNA genes .
The power of RNA-seq stems from its ability to generate deep coverage of the entire transcriptome of a cell with just a single run of a high-throughput sequencer, such as the Illumina HiSeq, which can produce up to 200 billion bases in a single run. The potential to characterize all genes, to capture alternative isoforms, and to measure differential expression has already been demonstrated in dozens of studies, but hundreds of species, and countless experimental conditions, are yet to be explored. Several groups have developed methods besides poly-A selection to capture all RNAs in a cell, for example, random hexamer priming [13,15], which allows them to analyze prokaryotic transcriptomes or to look at noncoding RNA in eukaryotes. It now appears that RNA-seq will replace microarray technology in the coming years, as it appears to be not only more comprehensive but also much more accurate than microarrays, particularly for transcripts with low expression levels . As this new method becomes even more widely adopted, it should greatly expand our understanding of the complex interplay of genes in all phases of cell development.
The author declares that he has no competing interests.
This was supported in part by National Institutes of Health grants R01-LM006845 and R01-GM083873.
|1||Adams MD, Kerlavage AR, Fields C, Venter JC: 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat Genet. 1993, 4:256–67.|
|2||Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5:621–8.|
|3||The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320:1344–9.|
|4||Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008, 133:523–36.|
|5||Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5:613–9.|
|6||A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321:956–60.|
|7||Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28:511–5.|
|8||Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 2010, 20:180–9.|
|9||Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456:470–6.|
|10||Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464:768–72.|
|11||Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464:773–7.|
|12||Genome-wide analysis of mRNA abundance in two life-cycle stages ofTrypanosoma brucei and identification of splicing and polyadenylation sites. Nucleic Acids Res. 2010 [Epub ahead of print].|
|13||A single-base resolution map of an archaeal transcriptome. Genome Res. 2010, 20:133–41.|
|14||The primary transcriptome of the major human pathogenHelicobacter pylori. Nature. 2010, 464:250–5.|
|15||Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci U S A. 2009, 106:3976–81.|
|16||Most “dark matter” transcripts are associated with known genes. PLoS Biol. 2010, 8:e1000371.|