The easiest way to run IMR/DENOM is using the project description textfile, which defines all the necessary information. A typical project might involve the assembly of more than one library with different insert sizes. The description file tells how to interpret the input files and group them together. A first simple example with a single library looks like this:
#---example.t--- outputfolder /home/xianggan/bur_0/ reference /data/genomes/arabidopsis_v2.fa iterations 5 threads 6 loaddata /data/save/s_5_1_sequence.txt /data/save/s_5_2_sequence.txt loaddata /data/save/s_6_1_sequence.txt /data/save/s_6_2_sequence.txt #only used by MCMERGE, can be download from the web sites profilebam /data/sim/sim_tair10.bam ## ---end--
simbam
can help to identify regions of the reference likely
to produce unreliable results, caused by the mapping algorithm,
repetitive regions or errors in the reference genome sequence.
Ideally, simbam
should be computed taking into
account the number of reads
read-length distributions in the original fastq
files (including multiple libraries), using IMR to align all
the simulated reads to the reference with the same parameters.
In practice, we have found this needs only be computed once, as unreliable loci
are quite stable. Therefore we reccommend that simbam
be directly downloaded from simulation bamfile
. At the moment, simbam
files for Arabidopsis, mouse and rat are available.
You can also generate your own simbam file by running the script sim_imrdenom.pl
with the reference genome as parameter. It will generate two simulated read files and a project description file. Run
imr easyrun --imrnocall [-m bwa] pro_descript
You will get a bam file. That is the simbam you need.
A second more complicated example involving two libraries looks like this:
#---example2.t--- outputfolder /data/sequencing_project/ reference /data/Arabidopsis/tair10_um.fa iterations 5 threads 6 maxreads 4000000 #Group 1 grouppara_1 ID:bur0LA,LB:bur0_LIB1,PL:Illumina,SM:bur0 prepara_1 -q solexa mappara_1 --substitutionrate=0.001 loaddata_1 /data/s_7_1_sequence.txt /data/s_7_2_sequence.txt #Group 2 grouppara_2 ID:bur0LB,LB:bur0_LIB2,PL:Illumina,SM:bur0 prepara_2 -q sanger loaddata_2 /data/save/Hi_SR03/solexafiles/s_4_sequence.txt loaddata_2 /data/save/Hi_SR03solexafiles/s_5_sequence.txt #only used by MCMERGE, can be download from the web sites profilebam /data/sim/sim_tair10.bam ## ---end--
This example uses additional entries to group the data:
ID:bur0_S2,LB:bur0_LIB2,PL:Illumina,SM:bur0means the readgroup name is
bur0_S2with all reads from library
bur0_LIB2and Illumina platform is used to produce the reads. [The default value is null.]
--substitutionrate=0.001for stampy]. [The default value is null.]
-q sanger, -q solexa, -q solexaoldThe default value is
--q sanger
Other useful keywords or tips:
/biodata/me/worm/on server 2, your project folder is
/externaldata/staff/wormIf you export DPATH=/biodata/me in server 1 and DPATH=/externaldata/staff in server 2, in your project description file you can use
outputfolder $DPATH/wormThis would allow you to use the same project description file in both servers (one runs IMR and the other runs DENOM). Environment variables are supported in all path related sections, e.g., reference, loaddata and profilebam.