-
Shotgun strategies help sequencing whole genomes in small fragments which are assembled into longer contigs afterwards.
-
RADseq strategies provide a reduced but consistent set of sequences of the genome which are especially used for population genetics.
-
Hybrid enrichment describes the specific enhancement of preselected sequences.
-
RNA-Seq analyses characterize the sequence content and according expression level of transcriptomes.
-
Technical developments paved the way to sequence genomes and transcriptome of single cells.
4.1 Shotgun Sequencing
The length of prokaryote and eukaryote
genomes exceeds by far the length of sequence reads produced by
available technologies. Moreover, in the case of eukaryotes, the
genomic information is distributed across a number of chromosomes.
Therefore, different strategies have been developed for complete
genome sequencing. Many of these methods have been explored in the
course of the human genome project, e.g. transposon-based methods
to integrate random insertions into cloned DNA or multiplex PCR
strategies (Green 2001; Church and
Kieffer-Higgins 1988). However,
the most common method is shotgun sequencing, which was developed
in the early 1980s (Anderson 1981;
Gardner et al. 1981). For shotgun
sequencing, a large stretch of DNA is fragmented into smaller
pieces. In the next step, random pieces of the fragmented DNA are
sequenced to generate redundant amounts of sequence data. Finally,
individual sequence reads are assembled to reconstruct the sequence
of the analysed genome (Green 2001). Two different strategies using shotgun
sequencing have been used in genome-sequencing projects (◘ Fig.
4.1): (I)
hierarchical shotgun sequencing and (II) whole-genome shotgun
sequencing.
Fig.
4.1
Overview of shotgun-sequencing methods.
a For hierarchical shotgun
sequencing, large fragments of the original chromosome are cloned
into BAC clones. BAC clones with overlapping fragments are chosen
according to physical mapping information and fragmented into small
fragments. BAC clone fragments are sequenced and assembled for each
clone separately. Assembled contigs will be overlapped according
mapping information to the final contig. b For whole-genome shotgun sequencing,
chromosomes will be directly fragmented, without mapping
information. Fragments will be sequenced and reads will be
assembled into contigs
For hierarchical shotgun sequencing
(◘ Fig. 4.1a),
large fragments of DNA are cloned using bacterial artificial
chromosomes (BACs). BACs are cloning vectors derived from
Escherichia coli plasmids
and have the advantage that the insertion of relatively large DNA
fragments (>100–300 Kb) is possible (Shizuya et al.
1992). Alternatively, other
cloning systems have been used, but less frequent than BACs. In a
second step, a physical map of the cloned DNA is established.
Various physical mapping approaches have been developed, including
BAC restriction-based fingerprinting (Marra et al. 1997), iterative hybridization (Mozo et al.
1999), and the use of BAC-end
sequences for connecting BAC clones by sequence identity (Mahairas
et al. 1999). Restriction-based
fingerprinting methods digest BAC clones by using a set of
restriction enzymes (e.g. two enzymes in case of double digest),
thereby generating a set of different sized fragments which can be
visualized using gel electrophoresis. For each BAC, a unique
pattern of bands on a gel is derived, and the presence and absence
of fragment sizes can be scored. Finally, all BACs are ordered in
relative position according to their similarity regarding shared
fragment sizes (Soderlund et al. 1997). Based on this information, a minimal set
of overlapping BACs which is in total completely covering a
selected genomic region (minimal tiling path) is selected. For
individual sequencing of BACs, their inserted DNA is purified and
physically shared to generate smaller fragments for sequencing. For
Sanger sequencing, broken ends of the sheared fragments are
enzymatically repaired, and all fragments are size fractionized
using gel electrophoresis. Medium-sized (2–3 Kb) fragments are
selected and cloned into sequencing vectors, which can be finally
sequenced using conserved primer sites in the vector. A random
collection of sequences of ~10x coverage is generated for each BAC,
which can be used for BAC contig assembly. Contigs for all BACs of
the minimal tiling path are overlapped according the information
from the physical mapping to generate the final sequence, which
should represent the sequenced genomic region. The first available
larger eukaryote genomes, e.g. Arabidopsis thaliana (The Arabidopsis
Genome Initiative 2000) and
Caenorhabditis elegans (The
C. elegans Sequencing
Consortium 1998), have been
sequenced with this approach. The International Human Genome
Sequencing Consortium (2001) used
hierarchical shotgun sequencing for the human genome.
Whole-genome shotgun (wgs) sequencing
(◘ Fig. 4.1b)
directly involves sequencing of sheared genomic DNA, thereby
leaving out the time-consuming step of establishing a physical map
(Green 2001). In case of using
Sanger sequencing, sheared DNA is end repaired, subcloned into
sequencing vectors and sequenced in a high coverage. Assembly of
this kind of sequence data usually leads to less continuous
contigs, as the topological information from physical mapping is
missing. Initially, this approach was mainly used for (small)
bacterial genomes. Weber and Myers (1997) used simulations to demonstrate the
practicability of wgs for sequencing large eukaryote genomes. Most
famously, this was validated in practice by Craig Venter and
colleagues by sequencing and assembling the human genome using wgs
data (Venter et al. 2001).
Next-generation sequencing (NGS)
techniques dramatically increased the output of sequencing reads,
and wgs approaches became a standard. However, the most powerful
methods in terms of sequence reads output (Illumina, Ion Torrent)
are also the methods producing the shortest reads (100–250 bp).
Especially the assembly of eukaryote genomes, which are often rich
in repetitive sequences, became a major challenge. One strategy to
provide extra information for assembling wgs data is the use of
mate-pair sequencing. Mate pairs describe the sequenced ends of DNA
fragments separated by a specific size. For example, if the ends of
a 3 Kb fragments are sequenced, the topological information that
these sequences should be separated by roughly this size can be
used to improve assemblies. Mate pair libraries have been developed
for all major short-read sequencing techniques (Illumina, 454, Ion
Torrent), and even though details may vary, the principle remains
the same. Most frequently, mate-pair sequencing is conducted with
Illumina, and therefore details are explained for this
method.
In the first step, genomic DNA is
sheared into fragments of the desired size (◘ Fig. 4.2a). Typical sizes for
mate pair libraries range from 2 to 5 Kb, even though larger
libraries (5 to 25 Kb) are also feasible (van Heesch et al.
2013). DNA fragments are end
repaired and the 3′-ends are labelled with biotin (◘ Fig.
4.2b). The
B-vitamin biotin is widely used in molecular biology and can be
covalently attached to proteins or nucleic acids. Biotin binds with
high specificity and very fast to streptavidin. Magnetic beads
covered with this protein can be used to specifically enrich
biotinylated molecules. The size of prepared fragments can be
selected using agarose gel electrophoresis, and size information is
essential for subsequent computational analysis. Biotinylated
fragments are circularized by intramolecular ligation (◘ Fig.
4.2c), and
remaining linear molecules are enzymatically removed. The
circularized DNA molecules are sheared again into a size of ~500 bp
(◘ Fig. 4.2d).
The fragments containing the biotinylated ends are selected using
streptavidin-covered magnetic beads (◘ Fig. 4.2e), and remaining
fragments are washed away. The selected fragments contain the
3′-ends of the original DNA fragments. Finally, sequencing adaptors
are attached to the selected fragments to prepare the sequencing
library (◘ Fig. 4.2f). Sequencing of these fragments generates
read pairs which align towards the ends of the original
size-selected fragment and are outward facing from each other. The
gap between these reads is approximately of the size of the
original fragment, and this information is valuable for contig
assembly and scaffolding of genomes (Chaisson et al. 2009).
Fig.
4.2
Construction of mate-pair libraries
(Illumina). a DNA is sheared
into fragments. b DNA
fragments are end repaired and biotinylated. c Biotinylated fragments are
circularized. d Circularized
DNA molecules are sheared into ~500 bp fragments. e Fragments containing biotinylated ends
are selected using streptavidin-covered magnetic beads; remaining
fragments are washed away. f
Adaptors for sequencing are ligated to selected fragments
Mapping strategies have been developed
to improve and validate wgs assemblies, e.g. optical mapping
(Schwarz et al. 2014). This method
is similar to the restriction-based fingerprinting approach
described above. For optical mapping, large DNA molecules are
immobilized on a surface and digested with one or more restriction
enzymes (◘ Fig. 4.3). The digested DNA molecules are stained
with a fluorescent dye. The length between adjoining cut sites is
estimated by measuring the fluorescence intensity. Mapping data of
each single DNA molecule is used to produce a consensus genomic
optical map, which includes an ordered series of DNA fragment sizes
(Mendelowitz and Pop 2014).
Recently, a high-throughput method of optical mapping using
nanochannels has been proposed (Lam et al. 2012). With this approach, DNA fragments are
nicked by an enzyme at specific sequence sites and subsequently
fluorescently labelled. With the help of an electric field,
molecules are driven through a nanoscale channel, where the DNA is
stretched. In this channel, distances between fluorescent labels
can be measured using a microscope. A unique optical pattern
resembling a barcode is created by the distance measure of the
labels (Michaeli and Ebenstein 2012).
Fig.
4.3
A workflow for optical mapping (By Fong
Chun Chan and Kendric Wang (Own work) [CC BY 3.0 (► http://creativecommons.org/licenses/by/3.0)],
via Wikimedia Commons)
A mapping strategy which became
recently popular has been commercialized by Dovetail Genomics and
is based on a Hi-C approach (Lieberman-Aiden et al. 2009). The idea behind Hi-C is that, after
fixation of chromatin structure, DNA segments which are in close
proximity in the nucleus are more likely to be ligated together.
This is reflected by the finding that the number of
intra-chromosomal ligation pairs decreases while the genomic
distance between them increases. With the so-called cHiCago
protocol, Hi-C mapping is used for the localization of chromatin
interactions to infer the relative order and orientation of contigs
(Putnam et al. 2016). Using this
protocol, chromatin is reconstituted in vitro and fixed with
formaldehyde. The fixed chromatin is then cut with a restriction
enzyme, thereby generating free sticky ends, which are filled with
biotinylated and thiolated nucleotides. In the next step, free
blunt ends are ligated, and chromatin crosslinks to generate
ligation mate pairs, which are fusions of fragments which are
distantly located in the genome. After library preparation, these
fragments can be sequenced with NGS methods. The mapping of these
fragments helps to dramatically improve genome assemblies based on
various NGS techniques (e.g. Illumina, PacBio). For example, by
using the cHiCago protocol, the scaffold N50 of the Illumina-based
genome assembly of the American alligator could be increased from
508 Kb to 10 Mb (Putnam et al. 2016) (◘ Fig. 4.4).
Fig.
4.4
Diagram of the cHiCago library preparation
protocol as used by Dovetail Genomics. a Chromatin (nucleosomes in blue) is
reconstituted in vitro upon naked DNA (black strand). b Fixation of chromatin by formaldehyde.
Red lines indicate
crosslinks. c Cutting of
fixed chromatin using restriction enzymes. d Filling of sticky ends with
biotinylated (blue circles)
and thiolated (green
squares) nucleotides. e Ligation of free blunt ends
(red asterisks).
f Fragments for library
preparation are yielded by reversion of crosslinks and removal of
proteins. Terminal biotinylated nucleotides are removed (Reprinted
from Putnam et al. (2016))
A different way to improve wgs
assemblies is by using long sequencing reads. This can be directly
done by sequencing with third-generation techniques such as
single-molecule real-time sequencing or nanopore sequencing.
Alternatively, long reads can also be generated synthetically for
Illumina short-read sequencing. Illumina itself distributes a
technique called TruSeq, which was formerly known under the name
Moleculo. With this approach, ~10 Kb DNA fragments are amplified
and barcoded before sequencing, and long reads can be created
afterwards based on this information. The company 10X Genomics
released an instrument called Chromium which used a similar but
more powerful approach for the generation of synthetic long reads.
Up to 100 Kb long DNA fragments are amplified and barcoded with an
emulsion PCR step. Subsequently, these fragments are sequenced in a
very low coverage, and sequenced barcodes localize clouds of short
reads which are used to scaffold de novo assemblies (Lee et al.
2016). The advantage of both these
methods is their considerably lower price compared to true
long-read sequencing. However, synthetically generated «long» reads
are prone to biases of the Illumina technology, e.g. less or no
coverage in regions with high GC content. Also, tandem repeats are
still difficult to tackle with this approach.
4.2 RADseq
Due to the advent of NGS techniques,
genome sequencing became feasible and affordable even for non-model
organisms and also smaller laboratories. However, for many studies,
it is sufficient to analyse a snapshot of the genome, but for a
high number of individuals. A set of related methods used to
sequence a reduced, but consistent representation of the genome is
known as restriction site-associated DNA sequencing (RADseq).
Applications of RADseq include discovery of genetic markers for
phylogenetics and population genetics (Cruaud et al. 2014; Davey et al. 2011), mapping of quantitative trait loci (QTLs)
(Houston et al. 2012), linking
mapping (Gonen et al. 2014) or
local genome assembly (Etter et al. 2011). The name RADseq was introduced for one
specific approach of reduced representation sequencing (Baird et
al. 2008), but is now used to
describe several similar methods (Andrews et al. 2016). Besides the original RADseq approach, this
family includes methods like ddRAD (Peterson et al. 2012), ezRAD (Toonen et al. 2013), 2bRAD (Wang et al. 2012), and the widely used genotyping by
sequencing (GBS) (Elshire et al. 2011).
The original RADseq protocol starts
with the digestion of genomic DNA with one restriction enzyme
(◘ Fig. 4.5a).
Restriction enzymes are able to cleave DNA in either random (type
I) or specific positions (type II). The first restriction enzyme
cutting specific sequence motive (HindII) was isolated from the bacterium
Haemophilus influenzae
(Smith and Welcox 1970). Since
that time, several thousand restriction enzymes (targeting
different sequence motives) have been described, and hundreds are
commercially available. A list of available restriction enzymes and
their properties are collected in the database REBASE (Roberts et
al. 2015). The choice of the
restriction enzyme greatly influences in how many pieces the genome
is cut. By a rule of thumb, the longer the recognized sequence
motive, the less fragments are generated. For example, a six-base
pair motive as recognized by the EcoR1 enzyme (◘ Fig. 4.5a) will cut every 4,000
bp, whereas an eight-base pair motive would only cut every 65,500
bp (Andrews et al. 2016). These
numbers are rough estimates and are greatly influenced by the base
composition of the investigated genome. Restriction enzymes can
either cut symmetrically, thereby generating blunt ends, or
asymmetrically. By using an asymmetrical cutting enzyme, all
fragments will bear so-called sticky ends, which describe the
overhang created by cutting with the restriction enzyme. An adaptor
can be ligated to these sticky ends, which includes a known primer
site for PCR amplification (◘ Fig. 4.5b). If adaptors bearing
unique barcode sequences are used, multiple libraries can be mixed
at this point (multiplexing). This barcode will be read during
sequencing and allows the separation of multiplexed samples. The
complete DNA library will be sheared, followed by reparation of
sequence ends. Using blunt-end ligation, a second adaptor is
ligated to all fragments (◘ Fig. 4.5c). This second adaptor is Y-shaped,
containing an only partially overlapping sequence. The resulting
DNA library will be amplified using a primer pair (e.g. P1 and P2)
(◘ Fig. 4.5d).
One sequencing primer site is nested in the first adaptor (P1). The
second primer site is identical to one of the nonoverlapping
sequence parts of the y-shaped adaptors (P2). The y-shaped adaptor
is completed when fragments containing the first adaptor are bound
by P1 and copied. Primer P2 only binds to the Y-shaped adaptor
after completion. Thereby, specificity of the amplification is
enhanced, as only fragments containing both adaptors are amplified
(◘ Fig. 4.5d).
The enhanced library can be sequenced using NGS. With this method,
thousands of single nucleotide polymorphic (SNP) loci can be
generated (Davey et al. 2011).
Fig.
4.5
Workflow of the original RADseq protocol.
a Genomic DNA is cut with a
chosen restriction enzyme (in this example EcoR1) for fragmentation. b Using the overhang as created by the
restriction enzyme, an adaptor is ligated to sequence fragments.
The complete pool of DNA is sheared mechanically. c Y-shaped adaptors are ligated to the
sheared pool of DNA fragments. d Using priming sites in both adaptors,
the DNA library is amplified. Only fragments containing both
adaptors can be successfully amplified
Several variants of the original
RADseq protocol have been developed (see above), which differ in
details of restriction enzyme digestion, size selection or adaptor
ligation (Andrews et al. 2016).
Commonly used alternative protocols are ddRAD and GBS. In the case
of double digest RADseq (ddRAD), two different restriction enzymes
are utilized to digest the genomic DNA (Peterson et al.
2012). Adaptors are ligated to
each cut site, and size selection is facilitated by choosing those
fragments, which are flanked by restriction enzyme recognition
sites that are neither too close or too distant (◘ Fig.
4.6). Using
this method, all reads of a given locus share the same fragment
size, as no shearing step is involved. Moreover, size selection
further decreases the number of analysed loci, which in turn
increases the coverage in terms of sequence reads. In contrast, in
the case of RADseq (see above), each sequenced fragment has a cut
site at one end and a randomly sheared end at the other. Thereby a
range of fragment sizes is produced for each locus (Andrews et al.
2016).
Fig.
4.6
Comparison of analysed loci by RADseq
a and ddRADseq b. In the case of ddRADseq, b size selection excludes regions flanked
by either [a] very close or
[b] very distant
restriction enzyme recognition sites (Figure from Peterson et al.
(2012))
GBS is basically a simplified protocol
of the RADseq approaches described above. DNA is digested with one
restriction enzyme, and a pair of adaptors is ligated to each
fragment. One adaptor contains a barcode unique for each library
(e.g. for single individuals); the other adaptor is a common
adopter used in all libraries (Elshire et al. 2011). Subsequently, all libraries are pooled
and a PCR is performed with primer sites nesting in the ligated
adaptors. The pooled and amplified library can be sequenced using
NGS. Modifications of this simple protocol, using two restriction
enzymes and y-shaped adaptors, have been published (Poland et al.
2012). GBS approaches have been
especially widely used for SNP discovery in large plant genomes
(Deschamps et al. 2012), but also
population genomic analyses (Friis et al. 2016).
The number of loci identified by
RADseq methods is influenced by the frequency of cut sites of the
chosen restriction enzymes, size selection (if applied), genome
size of the target organism and chosen RADseq method. If a
reference genome is available, in silico analyses can be performed
to optimize RADseq experiments (Lepais and Weir 2014). Such analyses are used to predict the
number of retrieved loci given the choice of restriction enzyme or
based on alternative methods. Even though in many cases there are
no reference genomes available, genome-wide surveys of frequencies
of restriction enzyme recognition sequences show a high variability
across eukaryotic taxonomic groups (Herrera et al. 2015). The frequency of this cleavage sites
seems to be similar among closely related species, which helps to
choose enzymes for RADseq experiments with organisms lacking a
reference genome. Moreover, as RADseq methods differ in costs and
hands-on time in the lab, these factors further influence the
numbers of samples which can be analysed. Pooling samples without
using individually barcoded adaptors are a cost-efficient
alternative, but may prohibit some downstream population genetic
analyses (Futschik and Schlötterer 2010; Andrews et al. 2014).
Advantages and disadvantages of
different RADseq methods have been discussed in detail (Puritz et
al. 2014; Andrews et al.
2014; Andrews et al. 2016). Several biases due to methodological
artefacts may influence the analysis of RADseq data in general. A
common problem is the introduction of PCR duplicates. These
duplicates do not represent independent samples from the analysed
genomic DNA pool. As independence of samples is an underlying
assumption of most population genetic analyses, this may result in
skewing allele frequencies, genotyping errors or false-positive
alleles (Andrews et al. 2014).
Putative PCR duplicates can be identified when using RADseq methods
that include a random-shearing step, as in the original RADseq
protocol (see above). By analysing paired-end sequence reads, PCR
duplicates can be identified as fragments that are identical across
forward and reverse reads (Davey et al. 2011). Additional sources of bias introduced
during PCR are preferential amplification of loci based on GC
content and fragment size, which may impact the variance of
sequence read coverage across loci (Puritz et al. 2014). Critical for all RADseq methods are
problems due to non-random sampling leading to systematic
underestimation of polymorphisms (Arnold et al. 2013; Huang and Knowles 2014). Non-random sampling results from
polymorphic recognition sequences of the used restriction enzymes,
resulting in missing data for some chromosomes/individuals (allelic
dropout).
4.3 Hybrid Enrichment
Hybrid enrichment methods are used for
the specific capture and enrichment of selected sequences (Lemmon
and Lemmon 2013). In short,
capture probes (DNA or RNA) that are complementary to targeted
regions in the genome are hybridized to a DNA library, and target
DNA is enriched by washing away nontargeted DNA prior to
high-throughput sequencing. This method has been used to enrich
selected single-copy orthologous loci for phylogenetic analyses, as
in anchored hybrid enrichment (AHE) (Lemmon et al. 2012) or enrichment of ultraconserved elements
(UCE) (Faircloth et al. 2012).
Moreover, it is widely used for the enrichment of exonic DNA (Li et
al. 2013) or organelle DNA (Briggs
et al. 2009). Prior to the
enrichment, long oligonucleotides (usually ∼60–120 bp) which cover
the target regions have to be designed and synthesized. For this
purpose, genomic or transcriptomic resources of the target species
or closely related species are used as a reference. In the case of
AHE, and when targeting UCEs, it has been shown that capture probes
could even be successfully designed for vertebrates across multiple
evolutionary timescales, in some cases spanning divergence times of
~500 million years (Lemmon et al. 2012; Faircloth et al. 2012). Capture probes can be designed for
several hundred to thousands of loci in parallel, which may involve
several thousand oligonucleotides. Most target enrichment
applications follow a solution-based enrichment protocol (sometimes
with modifications) as developed by Gnirke et al. (2009) (◘ Fig. 4.7). Designed oligonucleotides are synthesized
on a microarray (Lipshutz et al. 1999), cleaved and eluted. After initial PCR, a
T7 promoter sequence is added to the double-stranded DNA. This
promoter can be used to transcribe DNA to RNA with the help of T7
RNA polymerase. This polymerase is promoter specific in only
transcribing double-stranded DNA downstream of a T7 promoter
sequence (Studier and Moffatt 1986). The transcription takes place under the
presence of biotin-UTPs, thereby generating biotinylated
single-stranded RNA capture baits (◘ Fig. 4.7a). Meanwhile, genomic
DNA of the target organism is sheared, end repaired, adaptor
ligated (grey) and PCR amplified (◘ Fig. 4.7b). Capture of targets
will take place in solution. For this purpose, strands of genomic
DNA are separated and hybridized with the prepared biotinylated RNA
baits (◘ Fig. 4.7c). After hybridization, target DNA (and
unbound probes) can be captured using magnetic streptavidin-coated
beads (◘ Fig. 4.7c). Unbound DNA is washed away, whereas
captured and thereby enriched target DNA is eluted, PCR amplified
and ready to be sequenced using NGS platforms (◘ Fig. 4.7d).
Fig.
4.7
Principle of solution hybrid selection.
Colours represent
differently targeted DNA regions. Black diamonds represent biotin label.
a Long oligonucleotides are
synthesized on a microarray, cleaved and eluted. After initial PCR,
a T7 promoter is added to double-stranded DNA. In the presence of
biotin-UTP, biotinylated single-stranded RNA baits are generated
(milky lines with black
diamonds). b Genomic
DNA of the target organism is sheared, end repaired, adaptor
ligated (grey) and PCR
amplified. c Strands of
genomic DNA are separated and hybridized in solution with
biotinylated RNA baits. d
Free biotinylated RNA baits and those hybridizing to target DNA are
captured using streptavidin-coated magnetic beads. e Captured DNA fragments are eluted and
amplified by PCR
Especially two approaches became
widely used for phylogenomic studies. Anchored hybrid enrichment as
introduced by Lemmon et al. (2012)
identifies conserved DNA regions flanked by less conserved regions
for probe design. Usually alignments of genomically
well-characterized model species are exploited to design
oligonucleotides. AHE has been mostly used for phylogenetic
analyses of different groups of vertebrates (Prum et al.
2015; Eytan et al. 2015; Ruane et al. 2015). Faircloth et al. (2012) targeted UCEs, which have been initially
described as perfectly conserved segments of mammalian genomes
which are not functionally transcribed (Dermitzakis et al.
2005). Such regions have been also
described in other animals, but also plants and fungi (Siepel et
al. 2005; Zheng and Zhang
2008). Using UCEs has the
advantage that a set of loci can be characterized in highly
divergent reference genomes and later applied to a diverse set of
taxa, without the need of always designing new probes (Jones and
Good 2016). As UCEs are often
flanked by variable regions, this method also works across shallow
evolutionary timescales as, for example, demonstrated in the
phylogenetic analysis of a cichlid radiation (McGee et al.
2016).
Hybridization enrichment strategies
have been also successfully used when working with ancient DNA.
Often only a very low level of endogenous DNA is preserved in
ancient specimens (1–2%), while the majority represents
environmental DNA (Carpenter et al. 2013). Moreover, the DNA is normally highly
degenerated, and only short and also damaged fragments are present.
Consequently, wgs approaches might be not effective and too costly
when dealing with ancient DNA. Fu et al. (2013) developed capture probes targeting the
complete mitochondrial genome and representative portions from the
nuclear genome in ancient humans. It was furthermore possible to
sequence complete mitochondrial genomes from the oldest so far
investigated ancient humans (> ~300,000 years ago) (Meyer et al.
2014). This method has been also
demonstrated to work with highly degraded and ultrashort DNA in
non-permafrost-preserved cave bears from the Middle Pleistocene
(Dabney et al. 2013). Target
capture of mitochondrial genomes in permafrost-preserved horse
fossils even allowed the analyses of specimens which dated 560,000
to 780,000 years ago (Orlando et al. 2013).
Alternatively to in solution
hybridization methods, capture can take place directly on a
microarray (Albert et al. 2007).
DNA microarrays have been initially used to study gene expression
pattern (Schena et al. 1995), an
application which is now more and more supplanted by RNA-Seq (see
below). DNA microarrays are a collection of DNA sequences which are
attached to a surface (e.g. glass). Specific PCR products or
designed oligonucleotides can be printed at specified sites on
glass slides using high-precision arraying robots (Schulze and
Downward 2001). Complementary DNA
can be directly hybridized to DNA microarrays and thereby captured.
If this DNA is fluorescently labelled, the intensity of bound DNA
can be measured, e.g. to infer the relative expression of mRNA. In
the case of hybridization enrichment, genomic DNA is sheared,
adaptor ligated, amplified and hybridized with the array (Albert et
al. 2007). Non-hybridized DNA is
washed away, while the captured (and thereby enriched) DNA
fragments are eluted and prepared for subsequent NGS library
preparation. Liu et al. (2016)
demonstrated the successful enrichment of mitochondrial genomes of
insects using such a microarray capture approach.
4.4 Expressed Sequence Tags and RNA-Seq
The transcriptome comprises the
complete set of transcripts, as well as their quantity, of a cell
or population of cells. Several technologies are available to
sequence and quantify the transcriptome, including
hybridization-based approaches using microarrays (see above) or
direct sequencing (Wang et al. 2009). Using Sanger-based techniques, sequencing
of expressed sequence tags (ESTs) was established in the 1990s to
characterize transcriptomes (Adams et al. 1991), even though the lack of sequencing power
usually did not allow the quantification of gene expression. By
harnessing the power of NGS techniques, RNA-Seq became the method
of choice to sequence transcriptomes and to determine gene
expression levels. In general, for both methods RNA is reverse
transcribed to a library of cDNA fragments. The RNA can be total,
selected for transcripts carrying a poly-A-tail or depleted in
ribosomal RNA. Similarly, specific libraries targeting small RNAs
(e.g. tRNAs, microRNAs) can be constructed. For EST sequencing,
cDNA is cloned into an appropriate vector, which is sequenced from
both ends. Alternatively, directional cloning of cDNA is possible,
so that only 5′-ends of the sequences are sequenced, thereby
avoiding poly-A-tail sequences. Sequencing takes place with the
Sanger technique and usually an amount of a few hundred or
thousands transcript ends is manageable. This method played an
important role in gene discovery (Schuler 1997) and also paved the way for the first
broadscale phylogenomic studies in animals (Dunn et al.
2008). With dbEST, an entire
database hosted by NCBI GenBank is dedicated to EST sequences
(Boguski et al. 1993).
Transcriptome sequencing by RNA-Seq
exploits available NGS high-throughput technologies (Wang et al.
2009). As for EST sequencing, RNA
is firstly converted to a cDNA library. The cDNA fragments will
then be prepared for NGS methods by attaching adaptors to both
ends. The library is finally sequenced in a high-throughput manner
to obtain a high coverage of short sequence reads. RNA-Seq can be
used for transcriptome assembly, as well as expression profiling at
the same time. Especially for non-model organisms, RNA-Seq became
the method of choice for de novo transcriptome assembly, gene
discovery and gene expression comparisons (Ekblom and Galindo
2011; McCormack et al.
2013; Todd et al. 2016). By using RNA-Seq, hundreds to thousands
of putatively orthologous genes can be discovered, and thereby
transcriptome-based phylogenomic analyses became state of the art
to understand animal evolution (Telford et al. 2015; Dunn et al. 2014). Moreover, RNA-Seq is a powerful tool for
gene expression analyses. The expression level of genes is measured
by the number of sequenced fragments that map back to each
transcript. For RNA-Seq, abundance levels are given as mapped reads
per kilobase (RPKM) (Mortazavi et al. 2008). Compared to microarray studies, the
RNA-Seq approach offers several advantages (◘ Table 4.1), e.g. identification
of gene isoforms and allele-specific expression, nucleotide
polymorphisms and post-transcriptional base modifications (Malone
and Oliver 2011; Rapaport et al.
2013). Importantly, this approach
also enabled comparative gene expression studies for organisms
where reference genomes or transcriptomes are missing (Todd et al.
2016). Consequently, RNA-Seq
became a powerful approach to study differential gene expression,
which aims to investigate qualitative and quantitative differences
of genes expressed in different cell types (Gilbert 2013).
Table
4.1
Comparison of different methods
investigating gene expression (partly adopted from Wang et al.
(2009))
Microarray
|
ESTs
|
RNA-Seq
|
|
---|---|---|---|
Principle
|
Hybridization
|
Sanger
|
NGS (e.g. Illumina)
|
Resolution
|
Several to 100 bp
|
Single base pair
|
Single base pair
|
Throughput
|
High
|
Low
|
High
|
Prior genomic resources
|
Required
|
Not required
|
Not required
|
Isoform distinction
|
No
|
Yes
|
Yes
|
Allelic expression
|
No
|
Yes
|
Yes
|
As powerful and straightforward the
counting of mapped reads appears, several pitfalls have to be
avoided when working with RNA-Seq data (Tarazona et al.
2011; Vijay et al. 2013). The expression signal of any given
transcript is obviously limited by the sequencing depth and is
thereby also dependent on the level of expression of other
transcripts (Rapaport et al. 2013). Additionally, there is a transcript
length bias, as more reads map to long transcripts compared to
short transcripts of similar expression (Oshlack and Wakefield
2009). Thereby, the probability to
detect the presence as well as differential expression of a given
transcript varies strongly. Biological variance in gene expression
due to genetic or environmental differences can further complicate
RNA-Seq analyses (Todd et al. 2016). And, finally, bias can be introduced by
technical differences when comparing different sequencing runs (or
even lanes of a single flow cell) or different library preparations
(McIntyre et al. 2011). To deal
with these problems, gene expression experiments should be designed
carefully. For example, increased sequence depth may help to
uncover lowly expressed variants and alleviate problems related to
transcript length, but at the same time also increases the number
of false positives due to sequencing errors. As a rule of thumb,
the larger the genome of the analysed species, as more complex is
its transcriptome. For «simple» yeast transcriptomes, it was shown
that with 30 million short (35 bp) reads the expression of >90%
of the expected transcripts could be detected (Wang et al.
2009). For the more «complex»
chicken transcriptome, similar numbers (~30 million) of
medium-sized reads (75 bp) were enough to detect 90% of all
annotated genes, and even with 10 million reads, 80% of the genes
could be detected (Wang et al. 2011). By reviewing gene expression studies
across diverse sets of eukaryotes, Todd et al. (2016) recommend that efforts in the range of 5
to 20 million mapped reads per sample seem a sufficient sequencing
depth. There is also a trade-off in the number of biological
replicates to be sequenced and their costs. Such replicates can
improve estimates of variance for different sources of bias and are
obviously necessary to quantify biological variation. It has been
shown that the increase of number of biological replicates has a
stronger positive effect on the statistical power of differential
gene expression experiments than increasing the sequencing depth
for each sample (Liu et al. 2014).
Useful guidelines for the design of RNA-Seq experiments in the
context of evolutionary and ecological research questions are given
by Wolf (2013) and Todd et al.
(2016).
4.5 Single-Cell Genomics and Transcriptomics
Single-cell genomics and
transcriptomics aim to study genetic diversity on a cellular level
(Tang et al. 2011; Shapiro et al.
2013). Using these approaches it
is possible to study microbial ecosystems and cell lineage
relationships or to connect genotypes with phenotypes on a
single-cell level. However, the acquisition of high-quality
single-cell sequencing data comes with major technical challenges:
(1) physical isolation of individual cells, (2) amplification of
the genome (or transcriptome) of single cells for downstream
analyses and (3) analysing the data given the biases and errors
introduced during the first two steps (Gawad et al. 2016). The isolation of individual cells can be
facilitated by methods like serial dilution, microfluids,
micromanipulation, laser-capture microdissection or
fluorescence-activated cell sorting (FACS) (Yilmaz and Singh
2012). Single cells have to be
transferred to reaction tubes for subsequent DNA or RNA extraction.
In case of RNA, reverse transcription into cDNA is necessary.
Currently, amplification of the DNA (or cDNA) of single cells is
required to gain a sufficient amount of molecules for sequencing.
However, in the near-future single-molecule sequencing as performed
by third-generation sequencing, platforms (PacBio, Oxford Nanopore)
should supersede this step. It is possible to sequence the
transcriptome and genome of the same cell as demonstrated by
Macaulay et al. (2015).
Single-cell genomics has emerged as a
powerful tool to recover genomic information from uncultured,
individual cells of environmental microorganisms (Stepanauskas
2012). As this method recovers all
genomic information of a given cell, chromosomal and
extrachromosomal elements are recovered, thereby also detecting
possible infections by viruses. For example, Labonte et al.
(2015) demonstrated the
possibility to investigate host-virus relationships in marine
microbial communities. Further on, single-cell genomics helps to
link the genotype of so far unculturable prokaryotes with metabolic
functions as derived from annotation of their genomes. For example,
the investigation of ubiquitous but uncultured Proteobacteria
lineages sampled in the dark oxygenated ocean revealed potential
chemolithoautotrophy, thereby providing a new perspective on carbon
cycling of this large oceanic habitat (Swan et al. 2011).
Single-cell transcriptomic approaches
have been successfully implemented for evolutionary developmental
research. Lee et al. (2014)
developed fluorescent in situ RNA sequencing (FISSEQ), a method
where cDNA is directly sequenced within biological samples (tissue
sections, whole-mount embryos). Alternatively, Achim et al.
(2015) proposed to compare
transcriptomes from single-cell sequencing (of cells with unknown
spatial locations) with available expression profiles from a gene
expression atlas. Using this method >80% of cells could be
allocated to precise locations in the brain of the model annelid
Platynereis dumerilii.
Ultimately, these methods will help to resolve the origin, features
and fate of different cell types in complex tissues (Satija et al.
2015).
References
Anderson S (1981) Shotgun DNA
sequencing using cloned DNase I-generated fragments. Nucleic Acids
Res 9:3015–3027CrossRefPubMedPubMedCentral
Andrews KR, Good JM, Miller
MR, Luikart G, Hohenlohe PA (2016) Harnessing the power of RADseq
for ecological and evolutionary genomics. Nat Rev Genet
17:81–92CrossRefPubMedPubMedCentral
Baird NA, Etter PD, Atwood
TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson
EA (2008) Rapid SNP discovery and genetic mapping using sequenced
RAD markers. PLoS One 3:e3376CrossRefPubMedPubMedCentral
Boguski MS, Lowe TMJ,
Tolstoshev CM (1993) dbEST – database for «expressed sequence
tags». Nat Genet 4:332–333
Briggs AW, Good JM, Green
RE, Krause J, Maricic T, Stenzel U, Lalueza-Fox C, Rudan P,
Brajković D, Kućan Ž, Gušić I, Schmitz R, Doronichev VB, Golovanova
LV, de la Rasilla M, Fortea J, Rosas A, Pääbo S (2009) Targeted
retrieval and analysis of five Neandertal mtDNA genomes. Science
325:318–321CrossRefPubMed
Carpenter ML, Buenrostro JD,
Valdiosera C, Schroeder H, Allentoft Morten E, Sikora M, Rasmussen
M, Gravel S, Guillén S, Nekhrizov G, Leshtakov K, Dimitrova D,
Theodossiev N, Pettener D, Luiselli D, Sandoval K, Moreno-Estrada
A, Li Y, Wang J, Gilbert MTP, Willerslev E, Greenleaf WJ,
Bustamante CD (2013) Pulling out the 1%: whole-genome capture for
the targeted enrichment of ancient DNA sequencing libraries. Am J
Hum Genet 93:852–864CrossRefPubMedPubMedCentral
Chaisson MJ, Brinza D,
Pevzner PA (2009) De novo fragment assembly with short mate-paired
reads: does the read length matter? Genome Res
19:336–346CrossRefPubMedPubMedCentral
Dabney J, Knapp M, Glocke I,
Gansauge M-T, Weihmann A, Nickel B, Valdiosera C, García N, Pääbo
S, Arsuaga J-L, Meyer M (2013) Complete mitochondrial genome
sequence of a Middle Pleistocene cave bear reconstructed from
ultrashort DNA fragments. Proc Natl Acad Sci U S A
110:15758–15763CrossRefPubMedPubMedCentral
Deschamps S, Llaca V, May GD
(2012) Genotyping-by-sequencing in plants. Biology
1:460CrossRefPubMedPubMedCentral
Dunn CW, Hejnol A, Matus DQ,
Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe
GD, Sorensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen
RM, Wheeler WC, Martindale MQ, Giribet G (2008) Broad phylogenomic
sampling improves resolution of the animal tree of life. Nature
452:745–750CrossRefPubMed
Dunn CW, Giribet G,
Edgecombe GD, Hejnol A (2014) Animal phylogeny and its evolutionary
implications. Annu Rev Ecol Syst 45:371–395CrossRef
Elshire RJ, Glaubitz JC, Sun
Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE (2011) A robust,
simple genotyping-by-sequencing (GBS) approach for high diversity
species. PLoS One 6:e19379CrossRefPubMedPubMedCentral
Etter PD, Preston JL,
Bassham S, Cresko WA, Johnson EA (2011) Local de novo assembly of
RAD paired-end contigs using short sequencing reads. PLoS One
6:e18561CrossRefPubMedPubMedCentral
Eytan RI, Evans BR, Dornburg
A, Lemmon AR, Lemmon EM, Wainwright PC, Near TJ (2015) Are 100
enough? Inferring acanthomorph teleost phylogeny using anchored
hybrid enrichment. BMC Evol Biol 15:113CrossRefPubMedPubMedCentral
Friis G, Aleixandre P,
Rodríguez-Estrella R, Navarro-Sigüenza AG, Milá B (2016) Rapid
postglacial diversification and long-term stasis within the
songbird genus Junco: phylogeographic and phylogenomic evidence.
Mol Ecol 24:6175-6195.
Fu Q, Meyer M, Gao X,
Stenzel U, Burbano HA, Kelso J, Pääbo S (2013) DNA analysis of an
early modern human from Tianyuan Cave, China. Proc Natl Acad Sci U
S A 110:2223–2227CrossRefPubMedPubMedCentral
Futschik A, Schlötterer C
(2010) The next generation of molecular markers from massively
parallel sequencing of pooled DNA samples. Genetics
186:207–218CrossRefPubMedPubMedCentral
Gardner RC, Howarth AJ, Hahn
P, Brown-Luedi M, Shepherd RJ, Messing J (1981) The complete
nucleotide sequence of an infectious clone of cauliflower mosaic
virus by M13mp7 shotgun sequencing. Nucleic Acids Res
9:2871–2888CrossRefPubMedPubMedCentral
Gilbert S (2013)
Developmental biology, 10th edn. Sinauer Associates Inc.,
Sunderland
Gnirke A, Melnikov A,
Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos
G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C
(2009) Solution hybrid selection with ultra-long oligonucleotides
for massively parallel targeted sequencing. Nat Biotechnol
27:182–189CrossRefPubMedPubMedCentral
Gonen S, Lowe NR, Cezard T,
Gharbi K, Bishop SC, Houston RD (2014) Linkage maps of the Atlantic
salmon (Salmo salar) genome
derived from RAD sequencing. BMC Genomics 15:1–17CrossRef
Herrera S, Reyes-Herrera PH,
Shank TM (2015) Predicting RAD-seq marker numbers across the
eukaryotic tree of life. Genome Biol Evol 7:3207–3225CrossRefPubMedPubMedCentral
Houston RD, Davey JW, Bishop
SC, Lowe NR, Mota-Velasco JC, Hamilton A, Guy DR, Tinch AE, Thomson
ML, Blaxter ML, Gharbi K, Bron JE, Taggart JB (2012)
Characterisation of QTL-linked and genome-wide restriction
site-associated DNA (RAD) markers in farmed Atlantic salmon. BMC
Genomics 13:244CrossRefPubMedPubMedCentral
International Human Genome
Sequencing Consortium (2001) Initial sequencing and analysis of the
human genome. Nature 409:860–921CrossRef
Labonte JM, Swan BK, Poulos
B, Luo H, Koren S, Hallam SJ, Sullivan MB, Woyke T, Eric Wommack K,
Stepanauskas R (2015) Single-cell genomics-based analysis of
virus-host interactions in marine surface bacterioplankton. ISME J
9:2386–2399CrossRefPubMedPubMedCentral
Lee JH, Daugharthy ER,
Scheiman J, Kalhor R, Yang JL, Ferrante TC, Terry R, Jeanty SSF, Li
C, Amamoto R, Peters DT, Turczyk BM, Marblestone AH, Inverso SA,
Bernard A, Mali P, Rios X, Aach J, Church GM (2014) Highly
multiplexed subcellular RNA sequencing in situ. Science
343:1360–1363CrossRefPubMedPubMedCentral
Lee H, Gurtowski J, Yoo S,
Nattestad M, Marcus S, Goodwin S, McCombie W, Schatz M (2016)
Third-generation sequencing and the future of genomics. BioRxiv.
http://dx.doi.org/10.1101/048603
Lemmon EM, Lemmon AR (2013)
High-throughput genomic data in systematics and phylogenetics. Annu
Rev Ecol Syst 44:99–121CrossRef
Lieberman-Aiden E, van
Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I,
Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender
MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander
ES, Dekker J (2009) Comprehensive mapping of long-range
interactions reveals folding principles of the human genome.
Science 326:289–293CrossRefPubMedPubMedCentral
Macaulay IC, Haerty W, Kumar
P, Li YI, Hu TX, Teng MJ, Goolam M, Saurat N, Coupland P, Shirley
LM, Smith M, Van der Aa N, Banerjee R, Ellis PD, Quail MA, Swerdlow
HP, Zernicka-Goetz M, Livesey FJ, Ponting CP, Voet T (2015)
G&T-seq: parallel sequencing of single-cell genomes and
transcriptomes. Nat Methods 12:519–522CrossRefPubMed
Mahairas GG, Wallace JC,
Smith K, Swartzell S, Holzman T, Keller A, Shaker R, Furlong J,
Young J, Zhao S, Adams MD, Hood L (1999) Sequence-tagged
connectors: a sequence approach to mapping and scanning the human
genome. Proc Natl Acad Sci U S A 96:9739–9744CrossRefPubMedPubMedCentral
Malone JH, Oliver B (2011)
Microarrays, deep sequencing and the true measure of the
transcriptome. BMC Biol 9:34CrossRefPubMedPubMedCentral
Marra MA, Kucaba TA,
Dietrich NL, Green ED, Brownstein B, Wilson RK, McDonald KM,
Hillier LW, McPherson JD, Waterston RH (1997) High throughput
fingerprint analysis of large-insert clones. Genome Res
7:1072–1084CrossRefPubMedPubMedCentral
McGee MD, Faircloth BC,
Borstein SR, Zheng J, Darrin Hulsey C, Wainwright PC, Alfaro ME
(2016) Replicated divergence in cichlid radiations mirrors a major
vertebrate innovation. Proc R Soc Lond B Biol Sci
283:20151413CrossRef
McIntyre LM, Lopiano KK,
Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV (2011) RNA-seq:
technical variability and sampling. BMC Genomics 12:293CrossRefPubMedPubMedCentral
Mendelowitz L, Pop M (2014)
Computational methods for optical mapping. Gigascience
3:33CrossRefPubMedPubMedCentral
Orlando L, Ginolhac A, Zhang
G, Froese D, Albrechtsen A, Stiller M, Schubert M, Cappellini E,
Petersen B, Moltke I, Johnson PLF, Fumagalli M, Vilstrup JT,
Raghavan M, Korneliussen T, Malaspinas A-S, Vogt J, Szklarczyk D,
Kelstrup CD, Vinther J, Dolocan A, Stenderup J, Velazquez AMV,
Cahill J, Rasmussen M, Wang X, Min J, Zazula GD, Seguin-Orlando A,
Mortensen C, Magnussen K, Thompson JF, Weinstock J, Gregersen K,
Roed KH, Eisenmann V, Rubin CJ, Miller DC, Antczak DF, Bertelsen
MF, Brunak S, Al-Rasheid KAS, Ryder O, Andersson L, Mundy J, Krogh
A, Gilbert MTP, Kjaer K, Sicheritz-Ponten T, Jensen LJ, Olsen JV,
Hofreiter M, Nielsen R, Shapiro B, Wang J, Willerslev E (2013)
Recalibrating Equus
evolution using the genome sequence of an early Middle Pleistocene
horse. Nature 499:74–78CrossRefPubMed
Oshlack A, Wakefield MJ
(2009) Transcript length bias in RNA-seq data confounds systems
biology. Biol Direct 4:14CrossRefPubMedPubMedCentral
Peterson BK, Weber JN, Kay
EH, Fisher HS, Hoekstra HE (2012) Double Digest RADseq: an
inexpensive method for de novo SNP discovery and genotyping in
model and non-model species. PLoS One 7:e37135CrossRefPubMedPubMedCentral
Poland JA, Brown PJ,
Sorrells ME, Jannink J-L (2012) Development of high-density genetic
maps for barley and wheat using a novel two-enzyme
genotyping-by-sequencing approach. PLoS One 7:e32253CrossRefPubMedPubMedCentral
Putnam NH, O’Connell B,
Stites JC, Rice BJ, Fields A, Hartley PD, Sugnet CW, Haussler D,
Rokhsar DS, Green RE (2016) Chromosome-scale shotgun assembly using
an in vitro method for long-range linkage. Genome Res
26:342–350CrossRefPubMedPubMedCentral
Rapaport F, Khanin R, Liang
Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D (2013)
Comprehensive evaluation of differential gene expression analysis
methods for RNA-seq data. Genome Biol 14:1–13CrossRef
Ruane S, Raxworthy CJ,
Lemmon AR, Lemmon EM, Burbrink FT (2015) Comparing species tree
estimation with large anchored phylogenomic and small
Sanger-sequenced molecular datasets: an empirical study on Malagasy
pseudoxyrhophiine snakes. BMC Evol Biol 15:1–14CrossRef
Satija R, Farrell JA,
Gennert D, Schier AF, Regev A (2015) Spatial reconstruction of
single-cell gene expression data. Nat Biotechnol
33:495–502CrossRefPubMedPubMedCentral
Schwarz A, Cabezas-Cruz A,
Kopecky J, Valdes JJ (2014) Understanding the evolutionary
structural variability and target specificity of tick salivary
Kunitz peptides using next generation transcriptome data. BMC Evol
Biol 14
Shizuya H, Birren B, Kim UJ,
Mancino V, Slepak T, Tachiiri Y, Simon M (1992) Cloning and stable
maintenance of 300-kilobase-pair fragments of human DNA in
Escherichia coli using an
F-factor-based vector. Proc Natl Acad Sci U S A
89:8794–8797CrossRefPubMedPubMedCentral
Siepel A, Bejerano G,
Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J,
Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ,
Miller W, Haussler D (2005) Evolutionarily conserved elements in
vertebrate, insect, worm, and yeast genomes. Genome Res
15:1034–1050CrossRefPubMedPubMedCentral
Soderlund C, Longden I, Mott
R (1997) FPC: a system for building contigs from restriction
fingerprinted clones. Comput Appl Biosci CABIOS
13:523–535PubMed
Tang F, Lao K, Surani MA
(2011) Development and applications of single-cell transcriptome
analysis. Nat Methods 8:S6–11CrossRefPubMedPubMedCentral
Tarazona S, García-Alcalde
F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in
RNA-seq: a matter of depth. Genome Res 21:2213–2223CrossRefPubMedPubMedCentral
The Arabidopsis Genome
Initiative (2000) Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana.
Nature 408:796–815CrossRef
The C. elegans Sequencing
Consortium (1998) Genome sequence of the nematode C. elegans: a platform for
investigating biology. Science 282:2012–2018CrossRef
Toonen RJ, Puritz JB,
Forsman ZH, Whitney JL, Fernandez-Silva I, Andrews KR, Bird CE
(2013) ezRAD: a simplified method for genomic genotyping in
non-model organisms. Peer J 1:e203CrossRefPubMedPubMedCentral
van Heesch S, Kloosterman
WP, Lansu N, Ruzius F-P, Levandowsky E, Lee CC, Zhou S, Goldstein
S, Schwartz DC, Harkins TT, Guryev V, Cuppen E (2013) Improving
mammalian genome scaffolding using large insert mate-pair
next-generation sequencing. BMC Genomics 14:257CrossRefPubMedPubMedCentral
Venter JC, Adams MD, Myers
EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt
RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR,
Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G,
Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG,
Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M,
Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D,
Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S,
Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E,
Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I,
Charlab R, Chaturvedi K, Deng Z, Francesco VD, Dunn P, Eilbeck K,
Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P,
Heiman TJ, Higgins ME, Ji R-R, Ke Z, Ketchum KA, Lai Z, Lei Y, Li
Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM,
Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S,
Shao W, Shue B, Sun J, Wang ZY, Wang A, Wang X, Wang J, Wei M-H,
Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao
Q, Zheng L, Zhong F, Zhong W, Zhu SC, Zhao S, Gilbert D, Baumhueter
S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A,
Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D,
Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L,
Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N,
Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S,
Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline
L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T,
McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts
E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers Y-H, Romblad
D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R,
Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams
S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K,
Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal
A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan
A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz
B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M,
Carnes-Stine J, Caulk P, Chiang Y-H, Coyne M, Dahlke C, Mays AD,
Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H,
Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B,
Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C,
Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X,
Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T,
Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R,
Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R,
Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X (2001)
The sequence of the human genome. Science 291:1304–1351CrossRefPubMed
Wang Z, Gerstein M, Snyder M
(2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev
Genet 10:57–63CrossRefPubMedPubMedCentral
Wang Y, Ghaffari N, Johnson
CD, Braga-Neto UM, Wang H, Chen R, Zhou H (2011) Evaluation of the
coverage and depth of transcriptome by RNA-Seq in chickens. BMC
Bioinform 12(Suppl. 10):S5