Simulations with Reads2MapTools

There are three different ways to generate simulated data to build maps in Reads2MapTools:

This function just convert the PedigreeSim software files to OneMap raw data. The only source of errors are missing data controlled by miss.perc argument.

Simulating the VCF you can include other sources of errors. The map is simulated in PedigreeSim and a statistical model can be selected to simulate the reference and alternative alleles counts for each genotype. The genotypes can be reestimated according to the alleles counts simulated using the genotype callers updog (function updog_genotype), polyRAD (function polyRAD_genotype) and SuperMASSA (function supermassa_genotype).

The workflow SimulatedReads is available in Reads2Map repository in github. It performs the simulation of RADseq Illumina reads for outcrossing population, the SNP calling is made in Freebayes and GATK software, the genotype calling in updog, supermassa and polyRAD, and build genetics maps using every combination in OneMap and GUSMap. The workflow is wroten in Workflow Description Language (WDL) and cromwell workflow management system. The parameters of every software implented can be easily changed. All the results can be evaluated in a shiny app.

Run PedigreeSim

It will require java installed. You can run PedigreeSim directly and just use the output files in OneMap or you can use a function named run_pedsim that facilitates this procedure.

The function do not provide every possibilities offered by PedigreeSim software. If you want to change any parameter that is not available in the function, please use directly the PedigreeSim software. Here is the software documentation for more information.

# Required packages for this tutorial
library(onemap)
library(Reads2MapTools)
library(ggplot2)
library(vcfR)
# For outcrossing population
run_pedsim(chromosome = "Chr1", n.marker = 54, tot.size.cm = 100, centromere = 50,
           n.ind = 200, mk.types = c("A1", "A2", "A3", "A4", "B1.5", "B2.6", "B3.7", 
                                     "C.8", "D1.9", "D1.10", "D1.11", "D1.12", "D1.13", 
                                     "D2.14", "D2.15", "D2.16", "D2.17", "D2.18"),
           n.types = rep(3,18), pop = "F1", 
           name.mapfile = "mapfile.map", name.founderfile="founderfile.gen",
           name.chromfile="sim.chrom", name.parfile="sim.par",
           name.out="sim_out")

# For F2 population
run_pedsim(chromosome = c("Chr1", "Chr2"), n.marker = c(75, 75), 
           tot.size.cm = c(100,100), centromere = c(50,50),
           n.ind = 200, mk.types = c("A.H.B","C.A", "D.B"),
           n.types = rep(50,3), pop = "F2", 
           name.mapfile = "mapfile_f2.map", name.founderfile="founderfile_f2.gen",
           name.chromfile="sim_f2.chrom", name.parfile="sim_f2.par",
           name.out="sim_f2")

The function allows to create f2 intercross and backcross populations from bi-parental cross of inbred lines and segregating F1 population from bi-parental cross of heterozygous parents. You can define it in pop argument. You can also define the number of chromosomes (argument chromosome), the number of markers in each chromosome (n.marker), the total size of the groups in cM (tot.size.cm), the position of centromere (centromere), number of individuals in the population (n.ind), the marker types (mk.types, see the table in session Creating the data file of Outcrossing populations vignette) and the number of markers of each type (n.types).

We suggest you to open the output files founderfile, chromfile, mapfile and parfile to check if they agree with your intentions before proceed to other analysis.

Simulate OneMap raw data

Once you run PedigreeSim, you should have the output file genotypes.dat. To convert this to OneMap raw file, you just need to specify the cross type (cross), which ones are the parents (parent1 and parent2) and if you want to include missing genotypes (miss.perc). Only cross types outcross and f2 intercross are supported by now.

# For outcrossing population
pedsim2raw(genofile = "sim_out_genotypes.dat", cross="outcross", parent1 = "P1", parent2 = "P2", out.file = "sim_out.example.raw", miss.perc = 0)

onemap.obj <- read_onemap("sim_out.example.raw")
##  Working...
## 
##  --Read the following data:
##  Type of cross:           outcross 
##  Number of individuals:   200 
##  Number of markers:       54 
##  Chromosome information:  no 
##  Position information:    no 
##  Number of traits:        0
plot(onemap.obj, all = F)

# For F2 population
pedsim2raw(genofile = "sim_f2_genotypes.dat", cross="f2 intercross", parent1 = "P1", parent2 = "P2", out.file = "sim_f2.example.raw", miss.perc = 0)

onemap.obj <- read_onemap("sim_f2.example.raw")
##  Working...
## 
##  --Read the following data:
##  Type of cross:           f2 
##  Number of individuals:   200 
##  Number of markers:       150 
##  Chromosome information:  no 
##  Position information:    no 
##  Number of traits:        0
plot(onemap.obj)

Simulate VCF file

The same output file from PedigreeSim, the genotypes.dat can be used to simulate a VCF file together with the PedigreeSim mapfile and chrom. The advantages to simulate a VCF instead of onemap raw file is that VCF is a standard file format and can store a lot of other information included the allele counts, usually in the field AD or DPR. The pedsim2vcf function can simulate the allele counts using negative binomial or updog distributions (argument method). The main parameters for the distributions are defined with arguments mean.depth, that defines the mean allele depth in the progeny, p.mean.depth that defines the mean allele depth in the parents, argument disper.par defines the dispersion parameter, mean.phred defines the mean phred score of the sequencing technology used. The function can also simulate missing data (miss.perc). Through argument pos and chr you can define vectors with physical position and chromosome of each marker. With argument haplo.ref you define which one of the haplotypes in genotypes.dat will the considered the reference. Establishing phase as TRUE, the VCF will have phased genotypes. After allele counts are simulated, the genotypes are reestimated using a binomial distribution. The VCF generated by this function only have one or two FORMAT fields, the GT and AD (if counts = TRUE). Dominant markers are not supported by this function.

# Dominant markers are not supported then, simulate other dataset with only codominant markers
run_pedsim(chromosome = "Chr1", n.marker = 42, tot.size.cm = 100, centromere = 50,
           n.ind = 200, mk.types = c("B3.7", "D1.10", "D2.15"),
           n.types = rep(14,3), pop = "F1", 
           name.mapfile = "mapfile.map", name.founderfile="founderfile.gen",
           name.chromfile="sim.chrom", name.parfile="sim.par",
           name.out="sim_out")

# For outcrossing population
pedsim2vcf(inputfile = "sim_out_genotypes.dat", 
           map.file = "mapfile.map", 
           chrom.file = "sim.chrom", 
           out.file = "simu_out.vcf",
           miss.perc = 0, 
           counts = TRUE, 
           mean.depth = 100, 
           p.mean.depth = 100, 
           chr.mb = 10, 
           method = "updog", 
           mean.phred = 20,
           bias=1, 
           od=0.00001,
           pos=NULL,
           chr=NULL,
           phase = FALSE,
           disper.par=2)
## Counts simulation changed 0.2828854 % of the given genotypes
library(vcfR)
vcfR_out.obj <- read.vcfR("simu_out.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 4
##   header_line: 5
##   variant count: 42
##   column count: 211
## 
Meta line 4 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 42
##   Character matrix gt cols: 211
##   skip: 0
##   nrows: 42
##   row_num: 0
## 
Processed variant: 42
## All variants processed
onemap_out.obj <- onemap_read_vcfR(vcfR.object = vcfR_out.obj, 
                                   cross = "outcross", parent1 = "P1", parent2 = "P2")

plot(onemap_out.obj, all=F)

# For F2 population
# Dominant markers are not supported by this function
run_pedsim(chromosome = c("Chr1", "Chr2"), n.marker = c(75, 75), 
           tot.size.cm = c(100,100), centromere = c(50,50),
           n.ind = 200, mk.types = c("A.H.B"),
           n.types = 150, pop = "F2", 
           name.mapfile = "mapfile_f2.map", name.founderfile="founderfile_f2.gen",
           name.chromfile="sim_f2.chrom", name.parfile="sim_f2.par",
           name.out="sim_f2")

pedsim2vcf(inputfile = "sim_f2_genotypes.dat", 
           map.file = "mapfile_f2.map", 
           chrom.file = "sim_f2.chrom", 
           out.file = "simu_f2.vcf",
           miss.perc = 0, 
           counts = TRUE, 
           mean.depth = 100, 
           p.mean.depth = 100, 
           chr.mb = 10, 
           method = "updog", 
           mean.phred = 20,
           bias=1, 
           od=0.001,
           pos=NULL,
           chr=NULL,
           phase = FALSE,
           disper.par=2)
## Counts simulation changed 0.1412151 % of the given genotypes
vcfR_f2.obj <- read.vcfR("simu_f2.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 4
##   header_line: 5
##   variant count: 150
##   column count: 212
## 
Meta line 4 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 150
##   Character matrix gt cols: 212
##   skip: 0
##   nrows: 150
##   row_num: 0
## 
Processed variant: 150
## All variants processed
onemap_f2.obj <- onemap_read_vcfR(vcfR.object = vcfR_f2.obj, cross = "f2 intercross", parent1 = "P1", parent2 = "P2", f1 = "F1") 

plot(onemap_f2.obj, all=F)

Graphical view of genotypes and allele counts

Function create_depth_profile generates dispersion graphics with x and y axis pointing respectively the reference and alternative allele counts. The function is only available for biallelic markers and for outcrossing and f2 intercross population. Each dot represent a genotype considering mks markers and inds individuals. If are established NULL for both arguments, all markers and individuals are considered. The color of the dots are according with the genotypes present in onemap object (GTfrom = onemap) or in VCF file (GTfrom = vcf) or the color can represent the error rate (1 - highest genotype probability) of each genotype in onemap object (GTfrom = prob). A rds file is generated with the data in the graphic (rds.file). The alpha argument control the transparency of color of each dot, regulate this parameter is a good idea when having big amount of markers and individuals.

# For outcrossing population
create_depths_profile(onemap.obj = onemap_out.obj,
                      vcfR.object = vcfR_out.obj, 
                      parent1 = "P1", 
                      parent2 = "P2", 
                      vcf.par = "AD", 
                      recovering = FALSE, 
                      mks = NULL, 
                      inds = NULL, 
                      GTfrom = "vcf", 
                      alpha = 0.1,
                      rds.file = "depths_out.rds")
## Summary of reference counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   28.00   52.00   68.74   90.00  512.00 
## Summary of alternative counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   35.00   43.46   63.00  390.00

# For f2 intercross population
create_depths_profile(onemap.obj = onemap_f2.obj,
                      vcfR.object = vcfR_f2.obj, 
                      parent1 = "P1", 
                      parent2 = "P2",
                      vcf.par = "AD", 
                      recovering = FALSE, 
                      mks = NULL, 
                      inds = NULL, 
                      GTfrom = "vcf", 
                      alpha = 0.1,
                      rds.file = "depths_f2.rds")
## Summary of reference counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    28.0    51.0    66.2    87.0   714.0 
## Summary of alternative counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   34.00   43.43   63.00  399.00

Reestimate genotypes

If you choose to simulate allele counts in pedsim2vcf it is also possible to reestimate the genotypes using this simulated allele counts with other distributions. This will include in your data errors coming from random sampling if considering only poisson or negative binomial distributions and/or dispersion, outliers and bias errors if using updog model. Other advantage is that reestimating genotypes with any of this software you can obtain genotypes probabilities to be used in the map building instead of only genotypes.

warning: The reestimation of genotypes is only performed in biallelic codominant markers. If you have multiallelic markers in your VCF, they will be kept with the same genotype.

Updog, polyRAD and supermassa are software designed for genotyping polyploid species. This is a more complex procedure compared with diploid species. These software consider not only the proportion of alleles to define the genotypes, but other aspects as the expected distribution in the progeny according with parents genotypes. See more about them in respective manuals.

Genotypes probabilities usage

OneMap 3.0 have three options to define error probability in HMM emission phase. The default method is consider a global error rate for every genotype \(10^{-5}\). Until this present version, only this procedure was implemented. Now, with the function create_probs, we offer the option to change the global error rate (global_error argument), or consider a error rate by genotype (genotypes_errors) or a genotype probability for each genotype (genotypes_probs).

With updog

The function updog_genotype make interface of OneMap with Updog to perform the genotype calling.

onemap_geno.updog <- updog_genotype(vcf = "simu_out.vcf",  
                                    out_vcf = "updog.vcf",
                                    vcf.par = "AD",
                                    parent1="P1",
                                    parent2="P2", 
                                    crosstype = "outcross",
                                    f1=NULL,
                                    recovering = FALSE, 
                                    cores = 4,
                                    depths = NULL,
                                    global_error = NULL,
                                    use_genotypes_errors = TRUE,
                                    use_genotypes_probs = FALSE)
##     |                                   *.#,%    
##    |||                                 *******/  
##  |||||||    (**..#**.                  */   **/  
## |||||||||    */****************************/*%   
##    |||    &****..,*.************************/    
##    |||     (....,,,*,...****%********/(******    
##    |||                ,,****%////,,,,./.****/    
##    |||                  /**//         .*///....  
##    |||                  .*/*/%#         .,/   ., 
##    |||               , **/   #%         .*    .. 
##    |||                               ,,,*        
## 
## Working on it...done!New onemap object contains 42 biallelic markers
create_depths_profile(onemap.obj = onemap_geno.updog, 
                      vcfR.object = vcfR_out.obj, 
                      parent1 = "P1",
                      parent2 = "P2",
                      vcf.par = "AD", 
                      recovering = FALSE, 
                      GTfrom = "vcf") 
## Summary of reference counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   28.00   52.00   68.74   90.00  512.00 
## Summary of alternative counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   35.00   43.46   63.00  390.00

With polyRAD

The function polyRAD_genotype make interface of OneMap with polyRAD to perform the genotype calling.

polyRAD_genotype_vcf(vcf="simu_out.vcf", 
                     outfile = "polyrad.vcf",
                     parent1="P1",
                     parent2="P2")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 4
##   header_line: 5
##   variant count: 42
##   column count: 211
## 
Meta line 4 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 42
##   Character matrix gt cols: 211
##   skip: 0
##   nrows: 42
##   row_num: 0
## 
Processed variant: 42
## All variants processed
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 214
##   header_line: 215
##   variant count: 42
##   column count: 211
## 
Meta line 214 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 42
##   Character matrix gt cols: 211
##   skip: 0
##   nrows: 42
##   row_num: 0
## 
Processed variant: 42
## All variants processed
onemap_geno.polyRAD <- onemap_read_vcfR(vcf = "polyrad.vcf", cross = "outcross", parent1 = "P1", parent2 = "P2")

create_depths_profile(onemap.obj = onemap_geno.polyRAD, 
                      vcfR.object = vcfR_out.obj, 
                      parent1 = "P1",
                      parent2 = "P2",
                      vcf.par = "AD", 
                      recovering = FALSE, 
                      GTfrom = "vcf") 
## Summary of reference counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   37.00   52.00   71.68   93.50  214.00 
## Summary of alternative counts: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   25.00   49.00   57.46   80.00  225.00

With supermassa

The function supermassa_genotype make interface of OneMap with supermassa to perform the genotype calling. Warning: It requires ‘python 2’ installed.

onemap_geno.supermassa <- supermassa_genotype(vcf = "simu_out.vcf", 
                                              vcf.par = "AD",
                                              out_vcf = "supermassa.vcf",
                                              cross = "outcross",
                                              parent1="P1",
                                              parent2="P2",
                                              f1=NULL,
                                              recovering = FALSE,
                                              cores = 4,
                                              depths = NULL,
                                              global_error = NULL,
                                              use_genotypes_errors = FALSE,
                                              use_genotypes_probs = TRUE, 
                                              rm_multiallelic = F)

create_depths_profile(onemap.obj = onemap_geno.supermassa, 
                      vcfR.object = vcfR_out.obj, 
                      parent1 = "P1",
                      parent2 = "P2",
                      vcf.par = "AD", 
                      recovering = FALSE, 
                      GTfrom = "vcf") 

Example of usage

Here we want to compare the best map using updog counts simulations and updog, polyrad and supermassa genotyping and genotypes probabilities. As a simple example, we will simulate only one family and one chromosome.

for(depth in c(100, 50, 10)){
  run_pedsim(chromosome = "Chr1", n.marker = 42, tot.size.cm = 100, centromere = 50,
             n.ind = 200, mk.types = c("B3.7","D1.10", "D2.15"),
             n.types = rep(14,3), pop = "F1", 
             name.mapfile = "mapfile.map", name.founderfile="founderfile.gen",
             name.chromfile="sim.chrom", name.parfile="sim.par",
             name.out="sim_out")
  
  # VCF with unmodified genotypes
  pedsim2vcf(inputfile = "sim_out_genotypes.dat", 
             map.file = "mapfile.map", 
             chrom.file = "sim.chrom", 
             out.file = "simu_out.vcf",
             miss.perc = 0, 
             counts = FALSE)
  
  vcfR_out.obj <- read.vcfR("simu_out.vcf")
  onemap_out.obj <- onemap_read_vcfR(vcfR.object = vcfR_out.obj, cross = "outcross", parent1 = "P1", parent2 = "P2")
  
  twopts <- rf_2pts(onemap_out.obj)
  seq_ord <- make_seq(twopts, order(as.numeric(onemap_out.obj$POS)))
  map_real <- map(seq_ord, phase_cores = 4)
  p <- rf_graph_table(map_real, main = paste0("Real_",depth))
  ggsave(p, filename = paste0("real", depth,".jpeg"))
  
  # Simulating counts
  pedsim2vcf(inputfile = "sim_out_genotypes.dat", 
             map.file = "mapfile.map", 
             chrom.file = "sim.chrom", 
             out.file = "simu_out.vcf",
             miss.perc = 0, 
             counts = TRUE, 
             mean.depth = depth, 
             p.mean.depth = depth, 
             chr.mb = 10, 
             method = "updog", 
             mean.phred = 20,
             bias=0.8, 
             od=0.0001,
             pos=NULL,
             chr=NULL,
             phase = FALSE,
             disper.par=2)
  
  vcfR_out.obj <- read.vcfR("simu_out.vcf")
  onemap_out.obj <- onemap_read_vcfR(vcfR.object = vcfR_out.obj, cross = "outcross", parent1 = "P1", parent2 = "P2")
  
  onemap_geno.updog <- updog_genotype(vcf = "simu_out.vcf",  
                                    out_vcf = "updog.vcf",
                                    vcf.par = "AD",
                                    parent1="P1",
                                    parent2="P2", 
                                    crosstype = "outcross",
                                    f1=NULL,
                                    recovering = FALSE, 
                                    cores = 4,
                                    depths = NULL,
                                    global_error = NULL,
                                    use_genotypes_errors = TRUE,
                                    use_genotypes_probs = FALSE)

  twopts <- rf_2pts(onemap_geno.updog)
  seq_ord <- make_seq(twopts, order(as.numeric(onemap_geno.updog$POS)))
  map_updog <- map(input.seq = seq_ord, phase_cores = 4, rm_unlinked = T) # If HMM find problems between two markers, one of them will be automatically discarted and the sequence of markers without it will be returned
  while(is(map_updog, "vector")){  # if the result is a sequence of marker numbers, then HMM is run again
    # This will be repeteated until HMM can run with no problems
    seq_temp <- make_seq(twopts, map_updog)
    map_updog <- map(input.seq = seq_temp, phase_cores = 4, rm_unlinked = T)
  }
  p <- rf_graph_table(map_updog, main = paste0("Updog depth",depth))
  ggsave(p, filename = paste0("updog", depth,".jpeg"))
  
  polyRAD_genotype_vcf(vcf="simu_out.vcf", 
                     outfile = "polyrad.vcf",
                     parent1="P1",
                     parent2="P2")

  onemap_geno.polyRAD <- onemap_read_vcfR(vcf = "polyrad.vcf", cross = "outcross", parent1 = "P1", parent2 = "P2")

  twopts <- rf_2pts(onemap_geno.polyRAD)
  seq_ord <- make_seq(twopts, order(as.numeric(onemap_geno.polyRAD$POS)))
  map_polyRAD <- map(input.seq = seq_ord, phase_cores = 4, rm_unlinked = T)
  while(is(map_polyRAD, "vector")){
    seq_temp <- make_seq(twopts, map_polyRAD)
    map_polyRAD <- map(input.seq = seq_temp, phase_cores = 4, rm_unlinked = T)
  }
  p <- rf_graph_table(map_polyRAD, main = paste0("polyRAD depth", depth))
  ggsave(p, filename = paste0("polyrad", depth,".jpeg"))
  
onemap_geno.supermassa <- supermassa_genotype(vcf = "simu_out.vcf", 
                                              vcf.par = "AD",
                                              out_vcf = "supermassa.vcf",
                                              cross = "outcross",
                                              parent1="P1",
                                              parent2="P2",
                                              f1=NULL,
                                              recovering = FALSE,
                                              cores = 4,
                                              depths = NULL,
                                              global_error = NULL,
                                              use_genotypes_errors = FALSE,
                                              use_genotypes_probs = TRUE, 
                                              rm_multiallelic = F)
  
  twopts <- rf_2pts(onemap_geno.supermassa)
  seq_ord <- make_seq(twopts, order(as.numeric(onemap_geno.supermassa$POS)))
  map_supermassa <- map(input.seq = seq_ord, phase_cores = 4, rm_unlinked = T)
  while(is(map_supermassa, "vector")){
    seq_temp <- make_seq(twopts, map_supermassa)
    map_supermassa <- map(input.seq = seq_temp, phase_cores = 4, rm_unlinked = T)
  }
  p <- rf_graph_table(map_supermassa, main = paste0("supermassa depth", depth))
  ggsave(p, filename = paste0("supermassa", depth,".jpeg"))
  
  onemap_0.05 <- create_probs(onemap_geno.updog, global_error = 0.05)
  
  twopts <- rf_2pts(onemap_0.05)
  seq_ord <- make_seq(twopts, order(as.numeric(onemap_0.05$POS)))
  map_0.05 <- map(input.seq = seq_ord, phase_cores = 4, rm_unlinked = T)
  while(is(map_0.05, "vector")){
    seq_temp <- make_seq(twopts, map_0.05)
    map_0.05 <- map(input.seq = seq_temp, phase_cores = 4, rm_unlinked = T)
  }
  p <- rf_graph_table(map_0.05, main = paste0("0.05", depth))
  ggsave(p, filename = paste0("0.05", depth,".jpeg"))
  
  draw_map2(map_real, map_updog, map_polyRAD, map_supermassa, map_0.05, main = depth, 
            group.names = c("Real", "UD", "PR", "SM", "0.05"), 
            output = paste0(depth,"_maps.jpeg"))
}

Comparing estimated and simulated haplotypes

The simulations here included are very useful to test the efficiency of our algorithms, we made them available in case to be useful for other purposes. Using functions run_pedsim, pedsim2vcf with argument phased TRUE, we can convert the phased VCF using vcf2progeny_haplotypes, which keeps the original haplotype. With that, we can compare the haplotypes simulated and the ones estimated by the onemap approach.

Here we will use the file sim_out_cod_genotypes.dat to get one phased VCF. See that now we don’t want to simulate read counts or genotype errors; we just want the original haplotypes in VCF format, then we set counts = FALSE and don’t need to define the related arguments.

pedsim2vcf(inputfile = "sim_out_genotypes.dat", 
           map.file = "mapfile.map", 
           chrom.file = "sim.chrom", 
           out.file = "simu_out_phased.vcf",
           miss.perc = 0, 
           counts = FALSE, 
           chr.mb = 10,
           pos="cM",
           chr=NULL,
           phase = TRUE)

Now, we can convert the phased VCF to onemap_progeny_haplotypes object using function vcf2progeny_haplotypes. The position, the group names, and individuals id will be according to the ones contained in VCF file. This function can take some time to run if you select too many individuals or groups.

vcfR.object <- read.vcfR("simu_out_phased.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 4
##   header_line: 5
##   variant count: 42
##   column count: 211
## 
Meta line 4 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 42
##   Character matrix gt cols: 211
##   skip: 0
##   nrows: 42
##   row_num: 0
## 
Processed variant: 42
## All variants processed
progeny_dat <- vcf2progeny_haplotypes(vcfR.object = vcfR.object, ind.id = c("F1_004", "F1_005"), group_names = "Chr1", parent1 = "P1", parent2 = "P2", crosstype = "outcross")
plot(progeny_dat)

This is the simulated dataset. If OneMap algorithm could get all information needed, it could reproduce these haplotypes exactly how they were in simulated data. Let’s see how OneMap estimates these haplotypes:

onemap_test <- onemap_read_vcfR(vcfR.object = vcfR.object, cross = "outcross", parent1 = "P1", parent2 = "P2", only_biallelic = FALSE)

onemap_test # Just to remember of which dataset we are talking about
##   This is an object of class 'onemap'
##     Type of cross:      outcross 
##     No. individuals:    200 
##     No. markers:        42 
##     CHROM information:  yes 
##     POS information:    yes 
##     Percent genotyped:  100 
## 
##     Segregation types:
##                B3.7 -->  14
##               D1.10 -->  14
##               D2.15 -->  14
## 
##     No. traits:         0
twopts <- rf_2pts(onemap_test)
## Computing 861 recombination fractions ...
seq1 <- make_seq(twopts, "all")
map_test <- map(seq1)
progeny_est <- progeny_haplotypes(map_test, ind = c(4,5), most_likely = FALSE)
plot(progeny_est)

See that, besides markers are not in their exact position, the recombination breakpoints are the same as the ones simulated. The marker position could only be the same as the ones simulated if we have a very big (or maybe infinite) population size. Still, here we have only 200 individuals, but this is not a problem once we could reproduce the haplotypes.

Simulate Illumina reads

This one is a bit more complex and the its tools are stored in the github repository Reads2Map.

Remove generated files

system("rm sim* founderfile* mapfile* *rds ")
## Warning in system("rm sim* founderfile* mapfile* *rds "): 'rm' not found
## [1] 127