Cardamine hirsuta-IMR/DENOM short read genome assembler

Prepare Project Configuration File

The easiest way to run IMR/DENOM is using the project description textfile, which defines all the necessary information. A typical project might involve the assembly of more than one library with different insert sizes. The description file tells how to interpret the input files and group them together. A first simple example with a single library looks like this:

#---example.t---
outputfolder /home/xianggan/bur_0/
reference /data/genomes/arabidopsis_v2.fa
iterations 5
threads 6

loaddata /data/save/s_5_1_sequence.txt /data/save/s_5_2_sequence.txt
loaddata /data/save/s_6_1_sequence.txt /data/save/s_6_2_sequence.txt

#only used by MCMERGE, can be download from the web sites
profilebam /data/sim/sim_tair10.bam
## ---end--

outputfolder will create or re-use an existing folder to write all output files.
reference defines the reference genome file in FASTA format.
iterations defines the number of iterations. [The default value is 5]
threads defines the maximum number of threads/cores used. [see later]
loaddata will load FASTQ files produced by the Illumina sequencer. IMR supports both single-end and paired-end reads or any mixture of them. For paired-end data, use one loaddata command for each pair of files.
profilebam is an option used by mcmerge to set the simulated bam file.
simbam can help to identify regions of the reference likely to produce unreliable results, caused by the mapping algorithm, repetitive regions or errors in the reference genome sequence. Ideally, simbam should be computed taking into account the number of reads read-length distributions in the original fastq files (including multiple libraries), using IMR to align all the simulated reads to the reference with the same parameters. In practice, we have found this needs only be computed once, as unreliable loci are quite stable. Therefore we reccommend that simbam be directly downloaded from simulation bamfile . At the moment, simbam files for Arabidopsis, mouse and rat are available.
You can also generate your own simbam file by running the script sim_imrdenom.pl with the reference genome as parameter. It will generate two simulated read files and a project description file. Run imr easyrun --imrnocall [-m bwa] pro_descript You will get a bam file. That is the simbam you need.

A second more complicated example involving two libraries looks like this:

#---example2.t---
outputfolder /data/sequencing_project/
reference /data/Arabidopsis/tair10_um.fa
iterations 5
threads 6
maxreads 4000000

#Group 1
grouppara_1 ID:bur0LA,LB:bur0_LIB1,PL:Illumina,SM:bur0
prepara_1 -q solexa 
mappara_1 --substitutionrate=0.001
loaddata_1 /data/s_7_1_sequence.txt /data/s_7_2_sequence.txt

#Group 2
grouppara_2 ID:bur0LB,LB:bur0_LIB2,PL:Illumina,SM:bur0
prepara_2 -q sanger 
loaddata_2 /data/save/Hi_SR03/solexafiles/s_4_sequence.txt
loaddata_2 /data/save/Hi_SR03solexafiles/s_5_sequence.txt

#only used by MCMERGE, can be download from the web sites
profilebam /data/sim/sim_tair10.bam
## ---end--

This example uses additional entries to group the data:

grouppara_* sets the readgroup parameters for the output bam files. For example,
```
ID:bur0_S2,LB:bur0_LIB2,PL:Illumina,SM:bur0
```
means the readgroup name is
```
bur0_S2
```
with all reads from library
```
bur0_LIB2
```
and Illumina platform is used to produce the reads. [The default value is null.]
mappara_* is an additional option passed onto the mapper, [for example,
```
--substitutionrate=0.001
```
for stampy]. [The default value is null.]
prepara_* is an additional option to set what base quality is used for the reads and what kind if preprocessing is necessary. The possible options are
```
-q sanger, -q solexa, -q solexaold
```
The default value is
```
--q sanger
```

Other useful keywords or tips:

PICARD_PATH sets the directoy containing picard binary files.
jvmargs set the java jvm parameter for PICARD. IMR/DENOM use picard to mark duplicate reads. By default, -Xmx2g is used. However, for large genome or high-coverage data projet, you may need more memory. For example, jvmargs -Xmx6g will tell the system to use 6G memeory for java jvm.
Environment variables in project description file is supported. This could be very useful if your project folder is on a central storage system but has different mount point in two servers. For example: on server 1, your project folder is
```
 /biodata/me/worm/
```
on server 2, your project folder is
```
 /externaldata/staff/worm 
```
If you export DPATH=/biodata/me in server 1 and DPATH=/externaldata/staff in server 2, in your project description file you can use
```
outputfolder $DPATH/worm 
```
This would allow you to use the same project description file in both servers (one runs IMR and the other runs DENOM). Environment variables are supported in all path related sections, e.g., reference, loaddata and profilebam.

Cardamine hirsuta Genetic and genomic

Prepare Project Configuration File