Building Linkage Maps from Empirical Reads

2023-07-13

EmpiricalSNPCalling

Overview

The EmpiricalSNPCalling workflow performs the alignment of FASTQ to a reference genome, SNP calling with GATK tools (HaplotypeCaller, GenomicsDBImport, and GenotypeGVCFs), TASSEL, STACKs and Freebayes. The samples are split into chunks to be run in different nodes and optimize the analyses. Set the number of samples by chunk in the chunk_size input. Use max_cores to define the number of cores to be used in each node.

The workflow also includes de options to:

Remove of not the read duplicates
Perform the Hard Filtering in GATK results
Replace the VCF AD format field with counts from BAM files
Run MCHap software to build haplotypes based on GATK called markers

This workflow requires:

Diploid or polyploid specie
Single-end and paired-end reads

Workflow

EmpiricalSNPCalling

Subworkflows

create_alignment_from_gamilies_files

gatk_genotyping

freebayes_genotyping

tassel_genotyping

stacks_genotyping

Input files

The EmpiricalSNPcalling requires demultiplexed and cleaned FASTQ files. We made available a suggestion for preprocessing reads in PreprocessingReads.wdl.

EmpiricalSNPcalling receive as inputs:

samples_info:

For single-end reads: tsv file with the first column with the path to FASTQ files, a second column with sample names, and the third column with sample names and lane specifications. Example:

tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.sub.fastq   I_3_55  I_3_55.Lib1_C09_AGAAGTC
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249788.sub.fastq   I_3_66  I_3_66.Lib1_D06_GCCAACT
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249795.sub.fastq   PT_F    PT_F.Lib1_E09_TGAACAT
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249808.sub.fastq   PT_M    PT_M.Lib2_E06_CGATGCG

For pair-end reads: tsv file with the first column with the path to FASTQ files with forward reads (R1), a second column with the path to FASTQ files with reverse reads (R2), the third column with sample names, and the fourth column with sample names and lane specifications. Example:

tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R1.sub.fastq    tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R2.sub.fastq   I_3_55  I_3_55.Lib1_C09_AGAAGTC
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R1.sub.fastq    tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R2.sub.fastq   I_3_66  I_3_66.Lib1_D06_GCCAACT
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R1.sub.fastq    tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R2.sub.fastq   PT_F    PT_F.Lib1_E09_TGAACAT
tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R1.sub.fastq    tests/data/Ptremula_PRJNA395596_subset/fastq/SRR6249787.R2.sub.fastq   PT_M    PT_M.Lib2_E06_CGATGCG

pair_end: “true” if data set contain pair-end reads

ploidy: specie ploidy

n_chrom: specie number of chromosomes (this is used to parallelize freebayes in nodes)

chunk_size: how many samples are to be evaluated by GATK in a single same node

max_cores: maximum number of cores to be used by alignment and freebayes tasks

max_ram: maximum raw allowed in GATK and TASSEL tasks (megabyte)

empirical.references

ref_fasta: chromosome sequence in fasta format (only one chromosome at a time)
ref_fasta_index: index made by samtools faidx
ref_dict: index made by picard dict
ref_sa: index made by bwa index
ref_amb: index made by bwa index
ref_bwt: index made by bwa index
ref_ann: index made by bwa index
ref_pac: index made by bwa index

You can use the following containers to create these indexes. Example:

docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z samtools faidx tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z /usr/gitc/./bwa index tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa
docker run -v $(pwd):/data/ us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z java -jar /usr/gitc/picard.jar CreateSequenceDictionary R=tests/data/PtrichocarpaV3.0/Chr10.11.2M.fa O=tests/data/PtrichocarpaV3.0/Chr10.11.2M.dict

rm_dupli: if workflow should (“true”) or not (“false”) remove the duplicated sequences from the alignment file before the SNP calling analysis

replaceAD: if allele depth (AD) in VCFs generated by GATK and freebayes should be replaced by allele depth from alignment files (BAM)

run_gatk: “true” for running GATK

hardfilters: “true” for performing Hard filtering in GATK results (see more about it here)

gatk_mchap: “true” for running MCHap in GATK results (warning: still under development)

run_freebayes: “true” for running freebayes

run_tassel: “true” for running TASSEL

run_stacks: “true” for running STACKs

enzyme: enzyme name (required only if running TASSEL)

pop_map: population map file (used running STACKs, if not specified a single populations is considered)

P1 and P2: used to filter markers by segregation if running MCHap

The inputs: GatkGenotyping.vcf_simu, GatkGenotyping.depth, FreebayesGenotyping.vcf_simu, GatkGenotyping.seed can be ignored when running for Empirical reads. They are exclusive for simulations.

The files outputted by EmpiricalSNPCalling.wdl are inputs for EmpiricalMaps.wdl. However, we suggest checking the VCF markers’ quality parameters to apply proper filters before proceeding to EmpiricalMaps.wdl. The EmpiricalMaps.wdl is limited to running only one chromosome, make sure you filter the VCF to retain only the selected one.

EmpiricalMaps

Overview

The EmpiricalMaps workflow receives as input VCF files from EmpiricalSNPCalling workflow and results in linkage maps for a single chromosome running the possible combinations of:

SNP calling: GATK, TASSEL, STACKs and Freebayes
Dosage/genotype calling: updog, polyRAD and SuperMASSA
Linkage map build software: OneMap 3.0, MAPpoly, and GUSMap
Using genotype probabilities from GATK, Freebayes, TASSEL, STACKs, updog, polyRAD and SuperMASSA, and a global error rate of 5% and 0.001% in the OneMap or MAPpoly HMM.

It also has the options to:

Include or not multiallelic (MNP) markers
Apply filters using bcftools

This workflow uses:

Diploid bi-parental F1 population
Genomic positions for markers order

Subworkflows

genotyping_empirical

snpcaller_maps_empirical

gusmap_maps_empirical

mappoly_maps_empirical

Input files

vcfs: array with vcfs files. You can include one or more VCFs. It should be written in the format: “Maps.vcfs”: [“path/to/vcf1.vcf”, “path/to/vcf2.vcf”, “path/to/vcf3.vcf”]

vcfs_software: array with SNP calling software used to generate each one of the specified VCFs in the vcfs input. Format example: “Maps.vcfs_software”: [“GATK”, “GATK”, “freebayes”]

vcfs_counts_source: array with counts source information of each one of the VCFs in the vcfs input. This feature is related with the option ReplaceAD described in the EmpiricalSNPCalling section. Format example: “Maps.vcfs_counts_source”: [“VCF”, “BAM”, “VCF”]

dataset - parent1: parent 1 ID - parent2: parent 2 ID - name: experiment ID - chromosome: chromosome being evaluated (only one allowed) - cross: cross type (by now, only F1 is available) - multiallelics: consider or not the multiallelic markers

max_cores: maximum number of CPUs to be used in parallelized process

gatk_mchap: “true” if MCHap was used in GATK results

gatk_vcf_multi: vcf file resulted from MCHap (only required if gatk_mchap is “true”)

filters: specify filters to be applied by bcftools

filter_segr: “true” to filter markers by segregation distortion

filter_noninfo: filter out markers that are not informative (aa x bb, aa x NA, or vice-versa in outcrossing populations) before running the genotype/dosage calling

prob_thres: define threshold for filtering genotypes by their genotype probabilities

ploidy: define specie ploidy. If 2, the workflow will run OneMap and GUSMap, if higher than 2, it will run MAPpoly.

replaceADbyMissing: if “true”, it replace the allele depth (AD) field in input VCF file by missing (0,0) if the genotype (GT) is missing (./.)

run_supermassa: if “true”, run SuperMASSA genotype caller

run_polyrad: if “true”, run polyrad genotype caller

run_updog: if “true”, run updog genotype caller

run_gusmap: if “true”, run gusmap genotype caller and linkage map builder

genoprob_error: if “true”, the genotype probabilities from the genotype call software will be applied in the OneMap HMM approach to estimate the linkage map genetic distances

global_errors: array with global error values to be applied in the OneMap HMM approach to estimate the linkage map genetic distances. Example: [“0.05”, “0.001”].

genoprob_global_errors: array with global error values to be applied in the OneMap HMM approach to estimate the linkage map genetic distances combined with the genotype call software genotype probability. Example: [“0.05”, “0.001”].