IMR/DENOM short-read analysis software

IMR

IMR Installation

IMR requires additional packages to be installed in order to perform read-mapping. Please checke the IMR/DENOM installation for the default mode.

Following short-read mappers are supported:

stampy (1.0.14+ supported).
bwa
smalt
maq

(It will support other assemblies such as Bowtie, soap2 and zoom soon).

Make sure the read mapper is installed and available system-wide or copy the executable file into external/ subfolder.

Running IMR

IMR use the project description textfile mentioned before. When it is available, the default way to run IMR:

imr easyrun example.t

To use bwa as mapper rather than the default stampy.

imr easyrun  -m bwa example.t

To use IMR to align all reads to the reference without iteration to create a single bam file for visualization or other analysis.

imr easyrun --imrnocall example.t

To use IMR to call variants off an existing bam file, without iterations

imr imrcall  [options] {ref} {bamfile} [region...]

Other options of imr easyrun:

  --help                  produce help message
  -o [ --outputfile ] arg set the output sdi file
  -e [ --outbam ] arg     output the new bam file
  -f [ --format ] arg     file format used for preprocessing
  --imrnocall             Only map reads and merge bam files, no variant call
  --imrkeepdup            For the merged bam files, do not remove duplicates
  --imrstartfrommap       Start to map raw reads. It can reuse the previous 
                          finished part.
  --imrstartfromcall      Start from variant calling, no mapping or merging
  --mergeall              Merge all reads-group together then deal 
                          with(remove/keep) duplicate
  -m [ --mapper ] arg     the name of program used for mapping: 
                          bwa/maq/stampy/smalt [=stampy]
  --iterations arg        The number of rounds of iterations [=5]
  --iterstartfrom arg     Start Iteration from which round [=1]
  -p [ --threads ] arg    Maximum processors used, can be set in configure file
                          too [=4]
  -q [ --qual ] arg       fastq File format used: sanger,solexa,solexaold,usepr

IMR Parallel Computation Support

The alignment and analysis of next generation sequencing data are time-consuming. Even a common multi-core or multi-processor PC can benefit from IMR's parallel computation support by aligning several lanes simultaneously. Multi-threading is used as follows:

By default, IMR splits any large read files into several smaller files, each containing about 4 million reads (the size can be set using
```
maxreads XX
```
in the project description file), and then align them simultaneously. This optimization can also be cancelled by setting
```
imr easyrun -q nopre
```
on the command line or in the project description file using the parameter
```
prepara -q nopre
```
When multiple files/lanes are used, IMR can align all lanes simultaneously.
The remove duplicates procedure by picard/samtools is done simultaneously (by library).
The variant calling algorithms are region-based. By default IMR does variant calling simultaneously (by chromosome).

By setting the threads variable in the project description file. For example,

 threads 4

will align at most 4 lanes at one time.

Error Recovery

When an error occurs, such as loss of power or access to network storage, it is unnecessary to rerun everything from scratch. Instead, IMR can be restarted.For example:

imr easyrun -q usepre --iterstartfrom 2  example.t

will rerun IMR from second iteration.

imr easyrun -q usepre --iterstartfrom 2  --imrstartfromcall example.t

will rerun IMR from second iteration, starting from variant-calling.

Output

IMR produces three types of output files:

a series of updated reference fasta files representing the state of the reference sequence at each iteration. For simplicity, we hereafter assume the project folder, set by outputfolder in the project description is $sequencing_project. The new reference files can be found under folder /sequencing_project/. Their names are as follows (in the form

newref_*.fa,  * starting from A and ending at Z)

newref_A.fa      (The changed reference after the first iteration)
newref_B.fa       (The changed reference after the second iteration)
newref_C.fa       (The changed reference after the third iteration)
....

a series of bam files representing the reads aligned to the genome sequences in each iteration.
```
$sequencing_project/A/($project_basename).bam (reads aligned to the original reference)
$sequencing_project/B/($project_basename)_B.bam (reads aligned to the newref_A.fa)
$sequencing_project/C/($project_basename)_C.bam (reads aligned to the newreg_B.fa)
...
```
if --outbam is set, the specified file will be the same as the $sequencing_project/A/($project_basename)_A.bam, which is often used by MCMERGE or other variant calling algorithm.
The sequence differences (SNPs and INDELs) between the original reference and the genome investigated (final iterated reference). All variants are available in a single sdi file, $sequencing_project/pro/result_imr.sdi. If --outputfile is set, the specified file will also be created.

DENOM

DENOM aligns contigs obtained from de-novo assembly to a reference genome and call variants (ie differences between the contigs and the reference). In principle it can handle short read data too, but without extensive testing currently. It is designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the assembled sample.

DENOM is not designed to replace denovo assembly algorithms. On the contrary, it is designed to enhance them. Current denovo assemblers usually produce a large number of contigs (which may be scaffolded together to a limited extent), rather than complete chromosome sequences. DENOM is designed to achieve this.

DENOM is also complementary to IMR, in the sense that it can be used to integrte denovo contgs with the output of IMR. In assembling Arabidopsis thaliana, we have found that their combination improves both IMR and DENOM applied to the original reference genomes, especially in repetitive regions. Therefore we strongly suggest running both DENOM and IMR and then mergeing their result using MCMERGE. But DENOM itself is independent.

INSTALLING DENOM

BWA and SAMTOOLS must be installed on your system. Make sure they are on your executable path or external/ subfolder of your IMR/DENOM installation directory.

Running DENOM

Option 1, Running DENOM through the inferface to SOAPdenovo

Please install SOAPdenovo v12.04+ First.

 denom soapinteface <descriptiodescriptionnfile>

The contigfile is exactly the same one used for IMR

Output : DENOM create following files:

$sequencing_project/soapassembly/soap4denom.contig { SOAPdenovo output}
$sequencing_project/soapassembly/soap4denom.bam
$sequencing_project/soapassembly/soap4denom.sdi

The output file soap4denom.sdi, using the sdi format, containing all variants called by DENOM and the BAM file soap4denom.bam will be used by MCMERGE. Warning Since SOAPdenovo usually take a huge amount of memories (20G memory needed for arabidopsis with ~30x coverage), we strong suggest people to contact your admin before running this. In WTCHG, a special server is used to run this job.

Option 2, Run DENOM for when assembled contig file is available

Before running, it is necessary to assemble contigs using a denovo assembler. DENOM can directly use the result from either soapDenovo, ABYSS or velvet, with soapDenovo strongly suggested. When a FASTA format file of contigs is available, you can run using the command below.

     denom easyrun <ref.fa>  <contig.fa> <out.bam> <out.sdi>

     denom fasiege         prepare the fasta file for mapping 
     denom premap          prelimary mapping using bwa 
     denom varcall         call the variants

<out.sdi>

sdi format

<out.bam>

MCMERGE

MCMERGE merges the variants called from different algorithms/solutions. Currently, it is tuned to merge the outputs from IMR and DENOM, but the algorithm can be easily extended.

Running MCMERGE

    mcmerge easyrun [options]  <ref>  <imr.sdi>  <denom.sdi>  <imrbam>  <denombam>  <simbam>

options:

   --help                    produce help message
   -o [ --outfile ] arg      output file
   -p [ --process ] arg (=4) number of cores used
   -t [ --tmpdir ] arg (=./) directory for temporatory files

The input files imr.sdi, imrbam are produced by IMR, and denom.sdi, denombam are produced by DENOM. In addition, MCMERGE uses a simulation input file simbam, to identify regions of the reference likely to produce unreliable results, caused by the mapping algorithm, repetitive regions or errors in the reference genome sequence. Ideally, simbam should be computed taking into account the number of reads read-length distributions in the original fastq files (including multiple libraries), using IMR to align all the simulated reads to the reference with the same parameters. In practice, we have found this needs only be computed once, as unreliable loci are quite stable. Therefore we reccommend that simbam be directly downloaded from simulation bamfile . At the moment, simbam files for Arabidopsis, mouse and rat are available. You can also generate your own simbam file by running the script sim_imrdenom.pl with the reference genome as parameter. It will generate two simulated read files and a project description file. Run imr easyrun --imrnocall [-m bwa] pro_descript You will get a bam file. That is the simbam you need.

Get the assembled genome

   mcmerge getgenome [options] <ref> <last.sdi>

options:

   --help                    produce help message
   -o [ --outfile ] arg      set the output file
   -p [ --process ] arg (=4) set the number of processors used
   -t [ --tmpdir ] arg (=./) set the directory for temporary files

MCMERGE usese multi-threading, with the number of threads/cores controlled by setting the --process option.

Output

For command mcmerge easyrun, the output is a sdi file sdi format. You can set the filename either by setting --outfile or use cosole pipe (i.e. >filename) For command mcmerge getgenome, the output is a fasta file containing sequences for all chromosomes. You can set the filename either by setting --outfile or use cosole pipe (i.e. >filename)

Cardamine hirsuta Genetic and genomic

IMR/DENOM short-read analysis software

IMR

IMR Installation

Running IMR

IMR Parallel Computation Support

Error Recovery

Output

INSTALLING DENOM

Running DENOM

Option 1, Running DENOM through the inferface to SOAPdenovo

Option 2, Run DENOM for when assembled contig file is available

Running MCMERGE

Get the assembled genome

Output