IMR: Iterative reads-Mapping and Re-assembly
IMR is a programme that iteratively assembles the short reads generated by Illumina sequencers. It should handle other short read data too, but has not been tested on other platforms this moment. It was designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the sequenced sample that most reads are alignable using a standard short read mapper such as Stampy, MAQ, SMALT, BWA. However, we are extending it to work with heterozygous genomes. The novel aspect of IMR is how, starting from the reference sequence, it iteratively mutates it towards the sequence of the sample, and the algorithms used for variant calling. These are described below.
Iterative realignment has a potential advantage over a single pass aligner for describing complex loci. Briefly, at each iteration, reads are aligned to the current version of a consensus sequence for a genome, high-confidence SNPs and indels are called, and incorporated into a new consensus. This process is then repeated until additional rounds of iteration produce few (or alternating) changes in the consensus sequence.
For assembling Arabidopsis thaliana accessions, we use the TAIR10 reference sequence as the consensus for the first iteration, and then align reads using STAMPY{Lunter, #21}. For other inbred genomes, such as inbred strains of mice or rats, substitute the appropriate reference genome (mm9 or rn3.4). We have found that convergence occurs after about five iterations when the number of additional variants accepted is less than 2% of the number of the variants detected in the first iteration. At that point, the majority of remaining variants are unresolvable “heterozygotes” or cycle between alleles in successive iterations. These ambiguous positions can result for multiple reasons, including where repetitive read mappings are not resolvable, where there is copy number variation, or where genomes harbour residual heterozygosity.
Variant calling uses two algorithms:
We call SNPs by producing and processing pileup files in the same way as earlier versions of the SAMTOOLS package did{Li, 2009 #45} using default parameters (as this feature is no longer present in the current version of SAMTOOLS IMR implements this step). At the end of an iteration, a putative SNP is accepted/rejected using the default criteria in VarFilter{Li, 2009 #45}, i.e., it must be separated from other variable sites, its coverage of aligned reads is ≥3 and <100 and the root mean square (RMS) of the mapping qualities (PHRED scores) of the aligned reads is >25. SNPs at ill-defined sites where the reference sequence is not A,C,G or T, are only called if the major allele is supported by at least 80% of reads and coverage is within 20% of the mean genome-wide read coverage.
A confounding factor for describing genetic variation with alignment approaches is clusters of apparently heterozygous SNPs and small indels, which are sometimes an artifactual signature of an indel as the alignment of ends of reads over a genuine indel can generate SNP calls in preference to indels. By allowing for sequence variants to be inserted into a consensus, and removing them if they do not agree in further iterations, iterative mapping is potentially useful to resolve such instances. We generate short indel predictions for consensus modification when indel predictions are reported in the pileup files (up to about 30bp) and are the best call at a genomic position.
In contrast to short insertions or deletions, we detect long deletions (greater than 10bp) by local assembly. First, sites likely to contain an undetected deletion are inferred from read-pair data where the local observed insert size is longer than the expected library size (estimated by predicting the insert size of every pair of mapped reads), and where the insert size differs from the mean based on dynamic programming{Price, 2005 #46}. We estimate the breakpoints (to within ~30bp) based on the difference between the apparent insert size and the global mean value (where multiple libraries with different insert sizes are used, we allow for this by normalizing the insert size to a common value). Over this region we build left and right temporary consensus sequences by growing inwards from the breakpoints using the read mapping information. If there is a deletion, we expect that the two ends should overlap. To determine the precise breakpoints, we align the left and right consensus sequences using the Smith-Waterman algorithm{Smith, 1981 #48}.
We pay special attention to clusters of SNPs and indels that may be indicative of imbalanced substitutions (i.e., deletion of the consensus sequence with simultaneous insertion of a novel sequence of the same or, typically, different length). First, the best indel in a cluster is accepted and the nearby variants left for evaluation in the next iteration. Second, if there are no indels, we accept the SNP in the cluster with the highest RMS of the mapping qualities, if it passes the criteria above for SNPs at ill-defined reference sites. We ignore the other variants nearby, thus allowing those to be considered at the next iteration. We find that SNP and indel predictions that were artefacts due to misalignment of reads often disappear in subsequent iterations.
When all iterations are finished, the variants are re-evaluated to identify potential errors caused by read alignment errors (i.e., mismapping of repetitive reads). The variants detected in each iteration are mapped to the corresponding coordinate of the reference, thereby allowing iteration histories to be constructed. Variants that cycle over several iterations but eventually converge are accepted once they become stable, while unstable cycling variants are discarded.