IMR requires additional packages to be installed in order to perform read-mapping. Please checke the IMR/DENOM installation for the default mode.
Following short-read mappers are supported:
(It will support other assemblies such as Bowtie, soap2 and zoom soon).Make sure the read mapper is installed and available system-wide or copy the executable file into external/ subfolder.
IMR use the project description textfile mentioned before. When it is available, the default way to run IMR:
imr easyrun example.t
To use bwa as mapper rather than the default stampy.
imr easyrun -m bwa example.t
To use IMR to align all reads to the reference without iteration to create a single bam file for visualization or other analysis.
imr easyrun --imrnocall example.tTo use IMR to call variants off an existing bam file, without iterations
imr imrcall [options] {ref} {bamfile} [region...]Other options of
imr easyrun
:
--help produce help message -o [ --outputfile ] arg set the output sdi file -e [ --outbam ] arg output the new bam file -f [ --format ] arg file format used for preprocessing --imrnocall Only map reads and merge bam files, no variant call --imrkeepdup For the merged bam files, do not remove duplicates --imrstartfrommap Start to map raw reads. It can reuse the previous finished part. --imrstartfromcall Start from variant calling, no mapping or merging --mergeall Merge all reads-group together then deal with(remove/keep) duplicate -m [ --mapper ] arg the name of program used for mapping: bwa/maq/stampy/smalt [=stampy] --iterations arg The number of rounds of iterations [=5] --iterstartfrom arg Start Iteration from which round [=1] -p [ --threads ] arg Maximum processors used, can be set in configure file too [=4] -q [ --qual ] arg fastq File format used: sanger,solexa,solexaold,usepr
The alignment and analysis of next generation sequencing data are time-consuming. Even a common multi-core or multi-processor PC can benefit from IMR's parallel computation support by aligning several lanes simultaneously. Multi-threading is used as follows:
maxreads XXin the project description file), and then align them simultaneously. This optimization can also be cancelled by setting
imr easyrun -q nopreon the command line or in the project description file using the parameter
prepara -q nopre
By setting the threads variable in the project description file. For example,
threads 4will align at most 4 lanes at one time.
When an error occurs, such as loss of power or access to network storage, it is unnecessary to rerun everything from scratch. Instead, IMR can be restarted.For example:
imr easyrun -q usepre --iterstartfrom 2 example.t
will rerun IMR from second iteration.
imr easyrun -q usepre --iterstartfrom 2 --imrstartfromcall example.t
will rerun IMR from second iteration, starting from variant-calling.
IMR produces three types of output files:
outputfolder
in the project description is $sequencing_project. The new reference files can be found under folder /sequencing_project/. Their names are as follows (in the form newref_*.fa, * starting from A and ending at Z):
newref_A.fa (The changed reference after the first iteration) newref_B.fa (The changed reference after the second iteration) newref_C.fa (The changed reference after the third iteration) ....
$sequencing_project/A/($project_basename).bam (reads aligned to the original reference) $sequencing_project/B/($project_basename)_B.bam (reads aligned to the newref_A.fa) $sequencing_project/C/($project_basename)_C.bam (reads aligned to the newreg_B.fa) ...
if --outbam
is set, the specified file will be the same as the $sequencing_project/A/($project_basename)_A.bam, which is often used by MCMERGE or other variant calling algorithm.
--outputfile
is set, the specified file will also be created.
DENOM aligns contigs obtained from de-novo assembly to a reference genome and call variants (ie differences between the contigs and the reference). In principle it can handle short read data too, but without extensive testing currently. It is designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the assembled sample.
DENOM is not designed to replace denovo assembly algorithms. On the contrary, it is designed to enhance them. Current denovo assemblers usually produce a large number of contigs (which may be scaffolded together to a limited extent), rather than complete chromosome sequences. DENOM is designed to achieve this.
DENOM is also complementary to IMR, in the sense that it can be used to integrte denovo contgs with the output of IMR. In assembling Arabidopsis thaliana, we have found that their combination improves both IMR and DENOM applied to the original reference genomes, especially in repetitive regions. Therefore we strongly suggest running both DENOM and IMR and then mergeing their result using MCMERGE. But DENOM itself is independent.
BWA and SAMTOOLS must be installed on your system. Make sure they are on your executable path or external/ subfolder of your IMR/DENOM installation directory.
Please install SOAPdenovo v12.04+ First.
denom soapinteface <descriptiodescriptionnfile>
The contigfile is exactly the same one used for IMR
Output : DENOM create following files:
$sequencing_project/soapassembly/soap4denom.contig { SOAPdenovo output} $sequencing_project/soapassembly/soap4denom.bam $sequencing_project/soapassembly/soap4denom.sdi
The output file soap4denom.sdi
, using the
sdi format, containing all variants called by DENOM
and the BAM file soap4denom.bam
will be used by MCMERGE.
Warning
Since SOAPdenovo usually take a huge amount of memories (20G memory needed for arabidopsis with ~30x coverage),
we strong suggest people to contact your admin before running this. In WTCHG, a special server is used to run this job.
Before running, it is necessary to assemble contigs using a denovo assembler. DENOM can directly use the result from either soapDenovo, ABYSS or velvet, with soapDenovo strongly suggested. When a FASTA format file of contigs is available, you can run using the command below.
denom easyrun <ref.fa> <contig.fa> <out.bam> <out.sdi>behind easyrun:
denom fasiege prepare the fasta file for mapping denom premap prelimary mapping using bwa denom varcall call the variantsOutput : DENOM creates an output file
<out.sdi>
, using the
sdi format, and a BAM file <out.bam>
.
MCMERGE merges the variants called from different algorithms/solutions. Currently, it is tuned to merge the outputs from IMR and DENOM, but the algorithm can be easily extended.
mcmerge easyrun [options] <ref> <imr.sdi> <denom.sdi> <imrbam> <denombam> <simbam>options:
--help produce help message -o [ --outfile ] arg output file -p [ --process ] arg (=4) number of cores used -t [ --tmpdir ] arg (=./) directory for temporatory files
The input files imr.sdi, imrbam
are produced by
IMR, and denom.sdi, denombam
are produced by DENOM. In
addition, MCMERGE uses a simulation input file
simbam
, to identify regions of the reference likely
to produce unreliable results, caused by the mapping algorithm,
repetitive regions or errors in the reference genome sequence.
Ideally, simbam
should be computed taking into
account the number of reads
read-length distributions in the original fastq
files (including multiple libraries), using IMR to align all
the simulated reads to the reference with the same parameters.
In practice, we have found this needs only be computed once, as unreliable loci
are quite stable. Therefore we reccommend that simbam
be directly downloaded from simulation bamfile
. At the moment, simbam
files for Arabidopsis, mouse and rat are available.
You can also generate your own simbam file by running the script sim_imrdenom.pl
with the reference genome as parameter. It will generate two simulated read files and a project description file. Run
imr easyrun --imrnocall [-m bwa] pro_descript
You will get a bam file. That is the simbam you need.
mcmerge getgenome [options] <ref> <last.sdi>
options:
--help produce help message -o [ --outfile ] arg set the output file -p [ --process ] arg (=4) set the number of processors used -t [ --tmpdir ] arg (=./) set the directory for temporary filesMCMERGE usese multi-threading, with the number of threads/cores controlled by setting the
--process
option.
For command mcmerge easyrun
, the output is a sdi file sdi format. You can set the filename either by setting --outfile
or use cosole pipe (i.e. >filename
)
For command mcmerge getgenome
, the output is a fasta file containing sequences for all chromosomes. You can set the filename either by setting --outfile
or use cosole pipe (i.e. >filename
)