Before start, it is important to highlight that all software here presented have several parameters that can be adjusted for each given scenario. Here, we will use the default settings, but for the best usage of each one of them, it is important to explore their specificities. So… use it carefully!

What you will need to run this tutorial

Linux system
At least 4G RAM
At least 7G HD
Docker
Internet connection

Docker

This tutorial is based on docker containers, then you will save the time of installing programs, instead, you can spend it exploring the meaning of their parameters and biological concepts. See this tutorial to docker installation. You can see on the page that docker can be installed on Mac, Windows, or Linux. This tutorial is considering the Linux version.

After installing here are some useful commands:

docker pull                               # Get image from a registry
docker run                                # run a command in a new container
docker run -it -v                         # run in a interactive mode and transferring directory to container environment
docker images                             # list of images available
docker ps                                 # list containers
docker stop <cointainer_id>               # stop container
docker rm  <cointainer_id>                # remove container
docker rmi <image_id>                     # remove image
docker rmi $(docker images -q)            # remove all images
docker stop $(docker ps -a -q)            # stop all running containers 
docker rm $(docker ps -a -q)              # remove all stopped containers
docker build -t <image_name> Dockerfile   # build image from a Dockerfile
docker push                               # Send the built image to a repository (Docker hub)

To save time during the tutorial, pull now all the needed images:

docker pull cristaniguti/sratoolkit:latest
docker pull biocontainers/fastqc:v0.11.9_cv7
docker pull ewels/multiqc:latest
docker pull taniguti/stacks:latest
docker pull kfdrc/bwa-picard:latest-dev
docker pull kfdrc/samtools:latest
docker pull taniguti/freebayes:latest
docker pull taniguti/gatk-picard:latest

Run this command and go take a coffee, it can take some time. After you finish your analysis, don’t forget to remove all images and containers, if you don’t do that, it will occupy space in your computer.

Example data subset

To this analysis be possible in personal computers, we will evaluate only three samples from this study. They made the cross Catalpa bungei × Catalpa duclouxii and obtain phenotypic and genotypic data for the 200 \(F_1\) progenies. They made whole-genome sequencing for parents 16-PJ-3 and 7080 and RADseq for the 200 progenies. The sequences can be downloaded from the SRA database with the accession number PRJNA551333. The enzyme EcoRI was used to reduce the representation of the genome. These species are diploid and, despite the paper mentioned a reference genome of Catalpa bungei I couldn’t find it in any database (if someone finds it, please, send me an e-mail chtaniguti in usp.br). But no worries, let’s try to use a related one, the Handroanthus impetiginosus.

The PRJNA551333 is there, but how can we download it?

Let’s start with the magic of Docker (containers)! To download the sequences we will need to use the sratoolkit, more specifically the fasterq-dump tool. Fortunately, if you search in Docker Hub, there is already one image of the toolkit there, ready to use.

Let’s suppose that we didn’t see it. How could we make our image? Check its Dockerfile:

FROM ubuntu:18.04

RUN apt update \
    && apt install -y  build-essential wget \
    && wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.2/sratoolkit.2.10.2-ubuntu64.tar.gz \
    && tar xzf sratoolkit.2.10.2-ubuntu64.tar.gz \
    && cd sratoolkit.2.10.2-ubuntu64/bin \
    && ./vdb-config -i

WORKDIR /sratoolkit.2.10.2-ubuntu64/bin

For more information about Dockerfiles, check here. Once we defined all we need in the image, we can just run the command:

docker build -t cristaniguti/sratoolkit . #if you are in the same directory of the Dockerfile

If you want to store the image in the docker hub, you just need to have an account, log in and push it.

docker push cristaniguti/sratoolkit

Okay! Now, we can run our container based on this image.

To get the sequences we need their SRA id, you can find it looking in NCBI the project ID PRJNA551333. Click in the SRA database and after “Send results to Run Selector”, there, you can get a list with all 202 individuals. Let’s download only SRR9599311, SRR9599313, SRR9599314.

docker run -v $(pwd):/opt/ cristaniguti/sratoolkit ./fasterq-dump SRR9599314 -O /opt/

If you didn’t pull the image before, it will first do this before running the command (but only the first time you use this image).

Despite having only three samples, I let here a loop structure example in case later you want to download more samples and don’t want to write several lines of code to it.

echo "SRR9599311 SRR9599313 SRR9599314" > SRRs.txt

for i in $(cat SRRs.txt); do 
    echo $i
    docker run -v $(pwd):/opt/ cristaniguti/sratoolkit ./fasterq-dump 
SRR9599315 -O /opt/ # This command will take a while, the files are big ~ 1.2GB each
done

We have only six files, but they are still big for the next steps, we will get only a sample of each one and only use one of the pairs:

head -n 80000 SRR9599311_1.fastq > SRR9599311_1.sub.fastq # Just a subset of the reads
head -n 80000 SRR9599313_1.fastq > SRR9599313_1.sub.fastq # Just a subset of the reads
head -n 80000 SRR9599314_1.fastq > SRR9599314_1.sub.fastq # Just a subset of the reads

Pre-processing

A classic way to have a overview of our fastq sequences is the FASTQC software:

for i in *sub.fastq; do # Will run in all files in the directory finishing with sub.fastq
    echo $i
    docker run -v $(pwd):/opt biocontainers/fastqc:v0.11.9_cv7 fastqc /opt/$i
done

It returns to us an HTML file with several precious pieces of information. Instead of checking separated HTML by sample, we can join them using the very fancy MULTIQC tool.

docker run -v $(pwd):$(pwd)  -w $(pwd) ewels/multiqc . --title "Tutorial samples"

Results here.

This profile is typical from RADseq sequences, other types of libraries would return other profiles and could have different problems. Check other examples:

RNA-seq (gently donated by Fernando Henrique Correr from BioProject ID PRJEB38368 (DOI: 10.1186/s12864-020-07091-y))

bad sequences - They were discarded

good sequences

Metagenomic (gently donated by Maria Letícia Bonatelli (data evaluated after in HTStream v.1.3.2 software - unpublished))

bad sequences

To solve the problems she removed adapters and contamination sequences (Phix), removed duplicated sequences, and N bases. She also did the overlap between paired-end sequences.

good sequences

Whole-genome-sequence - Parent 16-PJ-3 of the Catalpa study.

Back to our example dataset:

Now that we already check how our sequences are, we can decide which filters to use to clean them. Here, the samples are already demultiplexed, but if it is not, process_radtags from STACKs perform the demultiplexing and some other quality filters, as searching for the enzyme cut site and according to sequence Phred scale. Check its manual here. Our samples do not present problems, we will use the process_radtags default parameters. If fastqc identify adapters in your sequence, we suggest you to use cutadapt software (docker image).

mkdir raw
mv *sub.fastq raw
mkdir filtered

docker run -v $(pwd):/opt taniguti/stacks  process_radtags -p /opt/raw/ -o /opt/filtered/ \
                      -e ecoRI \
                      -r -c -q

No barcodes specified, files will not be demultiplexed.
Processing single-end data.
Using Phred+33 encoding for quality scores.
Found 3 input file(s).
Barcode type unspecified, assuming unbarcoded data.
Processing file 1 of 3 [SRR9599314_1.sub.fastq]
  Processing RAD-Tags...
  20000 total reads; -0 ambiguous barcodes; -19380 ambiguous RAD-Tags; +586 recovered; -0 low-quality reads; 620 retained reads.
Processing file 2 of 3 [SRR9599311_1.sub.fastq]
  Processing RAD-Tags...
  20000 total reads; -0 ambiguous barcodes; -19435 ambiguous RAD-Tags; +525 recovered; -0 low quality reads; 565 retained reads.
Processing file 3 of 3 [SRR9599313_1.sub.fastq]
  Processing RAD-Tags...
  20000 total reads; -0 ambiguous barcodes; -19343 ambiguous RAD-Tags; +620 recovered; -0 low quality reads; 657 retained reads.
Closing files, flushing buffers...
Outputing details to log: '/opt/filtered/process_radtags.raw.log'

60000 total sequences
    0 barcode not found drops (0.0%)
    0 low quality read drops (0.0%)
58158 RAD cutsite not found drops (96.9%)
 1842 retained reads (3.1%)

Something weird? It discarded 96.9% of our samples. Yes! It is because the authors submitted the already filtered sequences, and they trimmed the enzyme cut site. Then, we don’t need to do filtering again, but now we learn how to do it if needed.

Alignment

There are many of them! The most used is the BWA MEM, the reason is mainly because of a need to standardize the studies. Something to think about: subtle batch effects caused by the usage of different tools in pre-processing. In humans: Functional Equivalence Specification

Did you already download the reference genome? Do it from here.

We will not make our personal computer soffer, we will use only a subset of it too, only the scaffold NKXS01000002.1:

grep -n ">" GCA_002762385.1_Himp0.1_genomic.fna > scaffolds.txt
head -n 10439 GCA_002762385.1_Himp0.1_genomic.fna | tail -n 5904 > subset_genome.fa

BWA requires some index files before the alignment, you can build them with:

docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev bwa index /opt/subset_genome.fa

# Sample SRR9599311
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev bwa mem -t 2 \
    -R "@RG\tID:Offspring73.SRR9599311\tLB:lib-SRR9599311\tPL:illumina\tSM:Offspring73\tPU:FLOWCELL1.LANE1.SRR9599311" \
    /opt/subset_genome.fa /opt/raw/SRR9599311_1.sub.fastq > SRR9599311_1.sub.bam
    
docker run -v $(pwd):/opt kfdrc/bwa-picard:latest-dev java -jar /picard.jar SortSam \
    I="/opt/SRR9599311_1.sub.bam" \
    O="/opt/Offspring73.SRR9599311.sorted.bam" \
    TMP_DIR=./tmp \
    SORT_ORDER=coordinate \
    CREATE_INDEX=true

# Sample SRR9599313
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev bwa mem -t 2 \
    -R "@RG\tID:Offspring172.SRR9599313\tLB:lib-SRR9599313\tPL:illumina\tSM:Offspring172\tPU:FLOWCELL1.LANE1.SRR9599313" \
    /opt/subset_genome.fa /opt/raw/SRR9599313_1.sub.fastq > SRR9599313_1.sub.bam
    
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev java -jar /picard.jar SortSam \
    I="/opt/SRR9599313_1.sub.bam" \
    O="/opt/Offspring172.SRR9599313.sorted.bam" \
    TMP_DIR=./tmp \
    SORT_ORDER=coordinate \
    CREATE_INDEX=true

# Sample SRR9599314
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev bwa mem -t 2 \
    -R "@RG\tID:Offspring241.SRR9599314\tLB:lib-SRR9599314\tPL:illumina\tSM:Offspring241\tPU:FLOWCELL1.LANE1.SRR9599314" \
    /opt/subset_genome.fa /opt/raw/SRR9599314_1.sub.fastq > SRR9599314_1.sub.bam
    
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev java -jar /picard.jar SortSam \
    I="/opt/SRR9599314_1.sub.bam" \
    O="/opt/Offspring241.SRR9599314.sorted.bam" \
    TMP_DIR=./tmp \
    SORT_ORDER=coordinate \
    CREATE_INDEX=true
    

mkdir aligned
mv *sorted.bam *sorted.bai aligned

# Use this command to check the alignment profile - Don't be frustrated with our results, remember that it is a subset
cd aligned
docker run -v $(pwd):/opt kfdrc/samtools:latest samtools flagstat /opt/Offspring73.SRR9599311.sorted.bam

If you have more than one genome and want to test how much your sample aligns with each one, FASTQ screen is a good tool for that.

SNP callers

There are many of them too! Here we will explore only three, the STACKs, FreeBayes, and HaplotypeCaller (from GATK toolkit). If you are interested in TASSEL, we suggest this other tutorial.

TASSEL and STACKs were designed specifically for RADseq sequences and their tutorials don’t need adaptation for this usage, but they are not too flexible as FreeBayes and HaplotypeCaller. Because of their flexibility, they provide other measures to help users better understand what is happening in their particular dataset and suggest different filters. Particularly, HaplotypeCaller and the entire GATK toolkit provides lots of tools and tutorials to explore deeply the sequences you have in hands, but it is also more complex compared to others.

STACKs

STACKs and TASSEL offer pipelines that do not need a reference genome. The RADseq sequences are aligned between themselves. Comparisons between these approaches show that without a reference genome few SNPs are called compared to those using reference genomes (references 1,2). Then, we suggest to use a reference genome, even if it is not from the same species but a related one. You can also use a combination of genomes or transcriptomes as made in this study.

Please, check the program documentation to adjust the best parameters. Stacks also allows running each one of its steps separately.

With the reference genome

mkdir stacks

docker run -v $(pwd):/opt taniguti/stacks ref_map.pl -T 2 -o /opt/stacks/ --popmap /opt/popmap.txt --samples ./opt/aligned/ 

# Generate VCF
docker run -v $(pwd):/opt taniguti/stacks populations --in_path /opt/stacks --popmap /opt/popmap.txt --vcf

The population map file in this case:

cat popmap.txt
Offspring73.SRR9599311.sorted    pop1
Offspring172.SRR9599313.sorted    pop1
Offspring241.SRR9599314.sorted    pop1

FreeBayes

Check the program documentation here.

mkdir freebayes

docker run -it -v $(pwd):/opt taniguti/freebayes freebayes -f /opt/subset_genome.fa \
                                                  /opt/aligned/Offspring172.SRR9599313.sorted.bam \
                                                  /opt/aligned/Offspring241.SRR9599314.sorted.bam \
                                                  /opt/aligned/Offspring73.SRR9599311.sorted.bam > freebayes/freebayes.vcf

GATK - Joint Call

GATK is a toolkit with more than 200 available tools. You can imagine a practical guide is not a place to explain all its features. Anyway, the Broad Institute team made really good work with tutorials on their website. GATK also has workflows available here for the most common scenarios. The bad news is that these scenarios did not evolve RADseq or GBS datasets. GATK workflows need adaptations to analyze these datasets. A good example is the duplicates, you can’t remove GBS duplicates and for WGS or exome, this is an important step. If you have lots of previous information about your species, as a good reference genome or a marker database, this information can be used to increase SNP and genotype calling in GATK with tools such as BQSR and Variant recalibrator. If you don’t have much previous information, GATK provides several quality measures for each marker called and you can apply what they call “hard filters” in your markers.

The called Joint Call makes the variant calling by sample and produces the g.vcf files, a VCF which have records for every position in the genome. After a database is created to join this information and we can extract it to the traditional VCF file. The analysis is done separately for several reasons, one is flexibility of processing, we can parallelize the way we want. Another is the called N+1 problem, which means that with the database created, you don’t need to repeat all the analysis if you want to add an extra sample.

# It requires other indexes
docker run -v $(pwd):/opt/ kfdrc/bwa-picard:latest-dev java -jar picard.jar CreateSequenceDictionary R=/opt/subset_genome.fa O=/opt/subset_genome.dict
docker run -v $(pwd):/opt/ cristaniguti/r-samtools samtools faidx /opt/subset_genome.fa

docker run -it -v $(pwd):/opt taniguti/gatk-picard /gatk/gatk HaplotypeCaller \
                                                    -ERC GVCF \
                                                    -R /opt/subset_genome.fa \
                                                    -I /opt/aligned/Offspring172.SRR9599313.sorted.bam \
                                                    -O /opt/Offspring172_rawLikelihoods.g.vcf \
                                                    --max-reads-per-alignment-start 0
                                                    
docker run -it -v $(pwd):/opt taniguti/gatk-picard /gatk/gatk HaplotypeCaller \
                                                    -ERC GVCF \
                                                    -R /opt/subset_genome.fa \
                                                    -I /opt/aligned/Offspring241.SRR9599314.sorted.bam \
                                                    -O /opt/Offspring241_rawLikelihoods.g.vcf \
                                                    --max-reads-per-alignment-start 0
                                                    
docker run -it -v $(pwd):/opt taniguti/gatk-picard /gatk/gatk HaplotypeCaller \
                                                    -ERC GVCF \
                                                    -R /opt/subset_genome.fa \
                                                    -I /opt/aligned/Offspring73.SRR9599311.sorted.bam \
                                                    -O /opt/Offspring73_rawLikelihoods.g.vcf \
                                                    --max-reads-per-alignment-start 0

grep ">" subset_genome.fa > interval_list_temp # Find scaffold name
sed 's/^.//' interval_list_temp > interval_list_temp2 # remove > at the beginning
awk  '{print $1;}' interval_list_temp2 > interval.list # gets only the first word

docker run -it -v $(pwd):/opt taniguti/gatk-picard /gatk/gatk GenomicsDBImport \
                                                  --genomicsdb-workspace-path /opt/my_database \
                                                  -L /opt/interval.list \
                                                  -V /opt/Offspring73_rawLikelihoods.g.vcf \
                                                  -V /opt/Offspring241_rawLikelihoods.g.vcf \
                                                  -V /opt/Offspring172_rawLikelihoods.g.vcf

tar -cf ~{path_gatkDatabase}.tar ~{path_gatkDatabase}

docker run -it -v $(pwd):/opt taniguti/gatk-picard /gatk/gatk GenotypeGVCFs \
                                                   -R /opt/subset_genome.fa \
                                                   -O /opt/gatk.vcf.gz \
                                                   -G StandardAnnotation \
                                                   -V gendb:///opt/my_database
                                                   
mkdir gatk
mv gatk.vcf* gatk

Now we have VCF files for STACKs, FreeBayes and GATK. We can not make comparisons about its performances because we only run a subset, but we can compare the information format outputted by each one.

Running bigger datasets

In the world outside this tutorial, you will face larger datasets, which will require more computational resources. Then, you need to run part of these analyses in clusters or clouds. If you don’t have access to any of this, try to make an account in Google Cloud, the firsts months and resources are free. You will need to explore better the configurations to that. Start from here.

In HPC clusters administrators likely include a management workload as slurm. This tool allows that users don’t exceed the HPC capacity established for individual usage. In other words, it makes the laws in the HPC, so all users can enjoy it as a civilized community. You will need to learn the rules before running your process, but usually, the administrator makes a tutorial available, just ask for it.

Also, in HPCs, docker is not always available because of security issues. Instead, you can use docker through singularity.

Or, to automatize all these interactions with HPC and cloud, you can use a workflow language. There are several available. The Broad Institute team develop a combination of the WDL workflow language with workflow manager Cromwell, which already solved a lot of technical issues evolved with the analysis of biological sequences in Cloud or HPC. See Genomics in the Cloud book and GATK tutorials.

Considerations

Scientists are always creating new things, and of course, it is great! All fields are evolving very fast and every time you look for, you find something new. But, if you are also creating new things, it is a good idea to freeze all other dependencies tools, so you have control of what is changing. You can do it with docker. But it is important to keep the images updated so you can follow the science development rhythm.

SNP and genotype calling - A practical guide