SimulatedReads
Overview
This workflow performs simulations of one or more (defined by
number_of_families
) bi-parental outcrossing population
haplotypes using PedigreeSim
software based on a provided
linkage map and SNP markers. It uses RADinitio
software,
the simulated haplotypes, and a reference genome to also simulate
genotyping-by-sequencing read sequences. After, it performs the SNP and
genotype calling and builds 68 linkage maps from the combinations:
- SNP calling:
GATK
andFreebayes
- Dosage/genotype calling:
updog
,polyRAD
andSuperMASSA
- Linkage map build software:
OneMap
3.0 andGUSMap
- Using genotype probabilities from
GATK
,Freebayes
,updog
,polyRAD
andSuperMASSA
, and a global error rate of 5% and 0.001% in theOneMap
HMM.
It also has the options to:
- Include or not multiallelic (MNP) markers
- Apply filters using
bcftools
This workflow uses:
- A reference linkage map
- A reference VCF file
- A single chromosome from a reference genome
- Diploid bi-parental F1 population
- Genomic positions for markers order
Workflow
SimulatedReads
Subworkflows
SimulatedSingleFamily
create_alignment_from_read_simulations
genotyping_simulated
gusmap_maps_simulated
snpcaller_maps_simulated
freebayes_genotyping
gatk_genotyping
Input files
First, go to our releases
page and download the assets of the most recent version of the
SimulatedReads
workflow.
You will need the following input information:
number_of_families : an integer defining the number
of families with popsize
individuals to be simulated
global_seed: This seed is used to generate the families seeds
max_cores: Maximum number of computer cores to be used
filters: filters in to be applied by
bcftools
in the VCF file after SNP calling
chunk_size: how many samples are to be evaluated by
GATK
in a single same node
replaceAD: if allele depth (AD) in VCFs generated by
GATK
and freebayes
should be replaced by
allele depth from alignment files (BAM)
hardfilters: true for performing Hard filtering in
GATK
results (see more about it here)
gatk_mchap: true if MCHap
was used in
GATK
results
n_chrom: specie number of chromosomes (this is used
to parallelize freebayes
in nodes)
family:
seed: seed to reproduce the analysis after - warning: some steps are still random, as the reads simulation
popsize: number of individuals at the progeny population
ploidy: the ploidy of the species, by now only diploid (2) species are supported
cross: cross-type. By now, only “F1” option is available
doses: if you do not have a VCF file with variants to be simulated, you can define here the percentage of markers with doses 0, 1, and 2 (when the cross is F1)
cmBymb: if you do not have a reference linkage map, you can simulate using a general recombination rate according to other genetic maps of the specie
sequencing:
library_type: the options RADseq, WGS, and Exome are available
multiallelics: Define with true or false, if the analysis should try to include multiallelic markers in the linkage maps
emp_vcf: reference VCF file with the variants to be simulated
emp_bam: reference BAM file. It will be used to define the reads profile in WGS and Exome simulation
ref_map: reference linkage map, it is a text file with two columns, one named “cM” with values for centimorgan position of markers and the other named “bp” with the respective base pair position of each marker. The markers in your reference map do not need to be the same as the VCF file Using splines, this map is used to train a model to define the position in centimorgan of the simulated variants in the genome
enzyme1: If RADseq, the enzyme used to reduce the genome representation
enzyme2: If RADseq, the second enzyme used to reduce the genome representation
vcf_parent1: parent 1 ID in the reference VCF
vcf_parent2: parent 2 ID in the reference VCF
chromosome: chromosome ID to be simulated
pcr_cycles: If RADseq, the number of PCR cycles used in the library preparation (default: 9)
insert_size: If RADseq, define the insert size in bp (default: 350)
read_length: If RADseq, define the read length in bp (default: 150)
depth: sequencing depth (default: 20)
depth_parents: sequencing depth in parents samples
insert_size_dev: If RADseq, define the insert size standard deviation in bp (default: 35)
mapsize: map size in centimorgan of the chromosome subset to be run
rm_dupli: if workflow should (true) or not (false) remove the duplicated sequences from the alignment file before the SNP calling analysis
references
ref_fasta: chromosome sequence in FASTA format (only one chromosome at a time, and no N are allowed)
ref_fasta_index: index made by samtools faidx
ref_dict: index made by picard dict
ref_sa: index made by bwa index
ref_amb: index made by bwa index
ref_bwt: index made by bwa index
ref_ann: index made by bwa index
ref_pac: index made by bwa index
You can use the following containers to create these indexes. Example:
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z samtools faidx tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z /usr/gitc/./bwa index tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z java -jar /usr/gitc/picard.jar CreateSequenceDictionary R=tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa O=tests/data/PtrichocarpaV3.0/Chr10.11.2M.dict
Example of how to run
Once you have all the input files, there are some options to run the workflow. See more about it on the openwdl page.
One example is using the cromwell
engine. Cromwell
offers different backend configurations
that allow running the workflows in different management systems such as
SLURM
and SGE
. It is also possible to set
configurations to use singularity containers and MySQL
database to keep a cache of runs. Check some of this configurations
options in our configurations
files. You can see further information about all options available
in its
documentation. To run the workflow through cromwell
,
you will need first an input file. Use the womtool
to generate the inputs template:
java -jar womtool.jar inputs SimulatedReads_v1.0.0.wdl > SimulatedReads.inputs.json
Adapt the input files paths and run cromwell:
Execute the workflow
java -jar /path/to/cromwell.jar run -p SimulatedReads_v1.0.0.zip -i SimulatedReads.inputs.json SimulatedReads_v1.0.0.wdl
Warning: This analysis demand high computer capacity
to run. Check some examples of configurations in
.configurations
directory. Here are two examples of how to
use them:
- For storing the metadata and cache in a
MySQL
database (see also cromwell-cli for easy access to metadata):
# Open mySQL cointainer
docker run -d -v banco_cromwell:/var/lib/mysql --rm --name mysql-cromwell -p 3307:3306 -e MYSQL_ROOT_PASSWORD=1234 -e MYSQL_DATABASE=cromwell mysql:5.7
# Execute the workflow
java -jar -Dconfig.file=.configurations/cromwell_mysql.conf -jar cromwell.jar run -p SimulatedReads_v1.0.0.zip -i SimulatedReads.inputs.json SimulatedReads_v1.0.0.wdl
# Or execute the workflow through the server interface
java -jar -Dconfig.file=.configurations/cromwell_mysql.conf -jar cromwell.jar server
In case you use the server interface, you must open it in your
browser in the pointed local address and submit the input
(.json
), workflow (.wdl
), and the directories
struct and tasks compressed (.zip
).
- If you are using a High-Performance Computing (HPC) with slurm management system:
BATCH file named slurm_main.sh
:
#!/bin/bash
#SBATCH --export=NONE
#SBATCH -J cromwell_Reads2Map
#SBATCH --nodes=1
#SBATCH --mem=1G
#SBATCH --time=01:30:00
#SBATCH -o /home/user/Reads2Map.log
#SBATCH -e /home/user/Reads2Map.err
# Maybe it will be required to import java and singularity modules here. Check the specifications of the HPC.
#module load singularity
#module load java
java -jar -Dconfig.file=/home/user/Reads2Map/.configurations/cromwell_no_mysql.conf \
-jar /home/user/Reads2Map/cromwell-65.jar \
run /home/user/Reads2Map/SimulatedReads_v1.0.0.wdl \
-p SimulatedReads_v1.0.0.zip \
-i /home/user/Reads2Map/inputs/SimulatedReads.inputs.json
When the run is ended, the log description printed on the screen will point to the path for the workflow output files.
Generating the workflows graphs
The figures of this tutorial were generated using
womtool
and graphviz
. Example:
java -jar womtool-84.jar graph SimulatedReads.wdl > SimulatedReads.gv
dot -Tsvg SimulatedReads.gv > SimulatedReads.svg
Visualize Reads2Map
workflows output in
Reads2MapApp
You can search for all workflow’s intermediary files in the
cromwell-executions
directory generated by the Cromwell.
The log
file will specify the workflow id and path for each
executed task. The final output of EmpiricalMaps.wdl
and
SimulatedReads.wdl
are compressed files called
EmpiricalReads_results.tar.gz
and
SimulatedReads_results.tar.gz
. These files contain tables
for an overview of the entire procedure. They are inputs for the
Reads2MapApp
, a shiny app that provides a graphical view of
the results. Check the Reads2MapApp
repository for further information.