Building Linkage Maps from Simulated Reads

2022-12-01

SimulatedReads

Overview

This workflow performs simulations of one or more (defined by number_of_families) bi-parental outcrossing population haplotypes using PedigreeSim software based on a provided linkage map and SNP markers. It uses RADinitio software, the simulated haplotypes, and a reference genome to also simulate genotyping-by-sequencing read sequences. After, it performs the SNP and genotype calling and builds 68 linkage maps from the combinations:

SNP calling: GATK and Freebayes
Dosage/genotype calling: updog, polyRAD and SuperMASSA
Linkage map build software: OneMap 3.0 and GUSMap
Using genotype probabilities from GATK, Freebayes, updog, polyRAD and SuperMASSA, and a global error rate of 5% and 0.001% in the OneMap HMM.

It also has the options to:

Include or not multiallelic (MNP) markers
Apply filters using bcftools

This workflow uses:

A reference linkage map
A reference VCF file
A single chromosome from a reference genome
Diploid bi-parental F1 population
Genomic positions for markers order

Workflow

SimulatedReads

Subworkflows

SimulatedSingleFamily

create_alignment_from_read_simulations

genotyping_simulated

gusmap_maps_simulated

snpcaller_maps_simulated

freebayes_genotyping

gatk_genotyping

Input files

First, go to our releases page and download the assets of the most recent version of the SimulatedReads workflow.

You will need the following input information:

number_of_families : an integer defining the number of families with popsize individuals to be simulated

global_seed: This seed is used to generate the families seeds

max_cores: Maximum number of computer cores to be used

filters: filters in to be applied by bcftools in the VCF file after SNP calling

chunk_size: how many samples are to be evaluated by GATK in a single same node

replaceAD: if allele depth (AD) in VCFs generated by GATK and freebayes should be replaced by allele depth from alignment files (BAM)

hardfilters: true for performing Hard filtering in GATK results (see more about it here)

gatk_mchap: true if MCHap was used in GATK results

n_chrom: specie number of chromosomes (this is used to parallelize freebayes in nodes)

family:

seed: seed to reproduce the analysis after - warning: some steps are still random, as the reads simulation
popsize: number of individuals at the progeny population
ploidy: the ploidy of the species, by now only diploid (2) species are supported
cross: cross-type. By now, only “F1” option is available
doses: if you do not have a VCF file with variants to be simulated, you can define here the percentage of markers with doses 0, 1, and 2 (when the cross is F1)
cmBymb: if you do not have a reference linkage map, you can simulate using a general recombination rate according to other genetic maps of the specie

sequencing:

library_type: the options RADseq, WGS, and Exome are available
multiallelics: Define with true or false, if the analysis should try to include multiallelic markers in the linkage maps
emp_vcf: reference VCF file with the variants to be simulated
emp_bam: reference BAM file. It will be used to define the reads profile in WGS and Exome simulation
ref_map: reference linkage map, it is a text file with two columns, one named “cM” with values for centimorgan position of markers and the other named “bp” with the respective base pair position of each marker. The markers in your reference map do not need to be the same as the VCF file Using splines, this map is used to train a model to define the position in centimorgan of the simulated variants in the genome
enzyme1: If RADseq, the enzyme used to reduce the genome representation
enzyme2: If RADseq, the second enzyme used to reduce the genome representation
vcf_parent1: parent 1 ID in the reference VCF
vcf_parent2: parent 2 ID in the reference VCF
chromosome: chromosome ID to be simulated
pcr_cycles: If RADseq, the number of PCR cycles used in the library preparation (default: 9)
insert_size: If RADseq, define the insert size in bp (default: 350)
read_length: If RADseq, define the read length in bp (default: 150)
depth: sequencing depth (default: 20)
depth_parents: sequencing depth in parents samples
insert_size_dev: If RADseq, define the insert size standard deviation in bp (default: 35)
mapsize: map size in centimorgan of the chromosome subset to be run
rm_dupli: if workflow should (true) or not (false) remove the duplicated sequences from the alignment file before the SNP calling analysis

references

ref_fasta: chromosome sequence in FASTA format (only one chromosome at a time, and no N are allowed)
ref_fasta_index: index made by samtools faidx
ref_dict: index made by picard dict
ref_sa: index made by bwa index
ref_amb: index made by bwa index
ref_bwt: index made by bwa index
ref_ann: index made by bwa index
ref_pac: index made by bwa index

You can use the following containers to create these indexes. Example:

docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z samtools faidx tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z /usr/gitc/./bwa index tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z java -jar /usr/gitc/picard.jar CreateSequenceDictionary R=tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa O=tests/data/PtrichocarpaV3.0/Chr10.11.2M.dict

Example of how to run

Once you have all the input files, there are some options to run the workflow. See more about it on the openwdl page.

One example is using the cromwell engine. Cromwell offers different backend configurations that allow running the workflows in different management systems such as SLURM and SGE. It is also possible to set configurations to use singularity containers and MySQL database to keep a cache of runs. Check some of this configurations options in our configurations files. You can see further information about all options available in its documentation. To run the workflow through cromwell, you will need first an input file. Use the womtool to generate the inputs template:

java -jar womtool.jar inputs SimulatedReads_v1.0.0.wdl > SimulatedReads.inputs.json

Adapt the input files paths and run cromwell:

Execute the workflow
java -jar /path/to/cromwell.jar run -p SimulatedReads_v1.0.0.zip -i SimulatedReads.inputs.json SimulatedReads_v1.0.0.wdl

Warning: This analysis demand high computer capacity to run. Check some examples of configurations in .configurations directory. Here are two examples of how to use them:

For storing the metadata and cache in a MySQL database (see also cromwell-cli for easy access to metadata):

# Open mySQL cointainer
docker run -d -v banco_cromwell:/var/lib/mysql --rm --name mysql-cromwell -p 3307:3306 -e MYSQL_ROOT_PASSWORD=1234 -e MYSQL_DATABASE=cromwell mysql:5.7

# Execute the workflow
java -jar -Dconfig.file=.configurations/cromwell_mysql.conf -jar cromwell.jar run -p SimulatedReads_v1.0.0.zip -i SimulatedReads.inputs.json SimulatedReads_v1.0.0.wdl

# Or execute the workflow through the server interface
java -jar -Dconfig.file=.configurations/cromwell_mysql.conf -jar cromwell.jar server

In case you use the server interface, you must open it in your browser in the pointed local address and submit the input (.json), workflow (.wdl), and the directories struct and tasks compressed (.zip).

If you are using a High-Performance Computing (HPC) with slurm management system:

BATCH file named slurm_main.sh:

#!/bin/bash

#SBATCH --export=NONE
#SBATCH -J cromwell_Reads2Map
#SBATCH --nodes=1                    
#SBATCH --mem=1G
#SBATCH --time=01:30:00
#SBATCH -o /home/user/Reads2Map.log
#SBATCH -e /home/user/Reads2Map.err

# Maybe it will be required to import java and singularity modules here. Check the specifications of the HPC.
#module load singularity
#module load java

java -jar -Dconfig.file=/home/user/Reads2Map/.configurations/cromwell_no_mysql.conf \
     -jar /home/user/Reads2Map/cromwell-65.jar \
     run /home/user/Reads2Map/SimulatedReads_v1.0.0.wdl \
     -p SimulatedReads_v1.0.0.zip \
     -i /home/user/Reads2Map/inputs/SimulatedReads.inputs.json

When the run is ended, the log description printed on the screen will point to the path for the workflow output files.

Generating the workflows graphs

The figures of this tutorial were generated using womtool and graphviz. Example:

java -jar womtool-84.jar graph SimulatedReads.wdl > SimulatedReads.gv   
dot -Tsvg SimulatedReads.gv > SimulatedReads.svg

Visualize `Reads2Map` workflows output in `Reads2MapApp`

You can search for all workflow’s intermediary files in the cromwell-executions directory generated by the Cromwell. The log file will specify the workflow id and path for each executed task. The final output of EmpiricalMaps.wdl and SimulatedReads.wdl are compressed files called EmpiricalReads_results.tar.gz and SimulatedReads_results.tar.gz. These files contain tables for an overview of the entire procedure. They are inputs for the Reads2MapApp, a shiny app that provides a graphical view of the results. Check the Reads2MapApp repository for further information.

Building Linkage Maps from Simulated Reads

SimulatedReads

Overview

Input files

Example of how to run

Generating the workflows graphs

Visualize Reads2Map workflows output in Reads2MapApp

Visualize `Reads2Map` workflows output in `Reads2MapApp`