| Abstract |
Heterogeneous
samples plague genome assemblers because polymorphisms obscure the
short read overlaps relied upon to stitch together the genome. As a
result, current metagenomic and population sequencing projects are
nearly impossible to assemble de novo. Recently, seven
different strains of Drosophila simulans were sequenced at low
coverage to study population variation—six at 1x coverage, and one
at 3x coverage [1]. De novo assembly of
this diverse population proved difficult, and the average D.
simulans contig size was seven times smaller than for the other
Drosophila sequenced at similar coverage [2].
Comparative genome
assembly (or syntenic assembly) provides a solution to the problems
posed by heterogeneous sequencing samples. In comparative assembly,
the traditional overlapping step is bypassed, and the layout of reads
is inferred instead from an alignment to a reference genome. This
approach is more robust to polymorphisms because it relies on
whole-read alignments to the reference, rather than short overlaps
between reads. The longer alignments are more robust to errors,
easier to compute, and easier to correctly place along the genome.
These benefits are critical for low-coverage and heterogeneous
sequencing projects in which quality read overlaps are sparse.
Using the closely
related D. melanogaster as a reference, we aggressively
co-assembled the seven sequenced strains of D. simulans using
our comparative assembly program, AMOScmp [3].
The AMOScmp contigs were then passed to Celera Assembler to assemble
and scaffold alongside reads that failed to match the reference
genome. Our improved co-assembly increases depth of coverage
threefold over the original assembly and contains thousands of
additional genes. These results show that comparative assembly is a
promising means for assembling diverse population samples and
outperforms traditional assembly in quality and the number of genes
it is able to successfully recover. In addition, when combined with
overlap-based assembly, comparative assembly can succeed even for
reference genomes of a different species.
|