This vignette describes how to use Qploidy
, an R package
designed for ploidy and aneuploidy estimation using genotyping platform
data. For a detailed explanation of the Qploidy
methodology, please refer to its publication.
When does the Qploidy
methodology work?
The Qploidy
approach is effective under the following
conditions:
- Your marker data originates from Axiom or
Illumina genotyping arrays.
- Your marker data is derived from targeted sequencing
platforms (e.g., DArTag, GTseq, AgriSeq).
- All DNA samples were prepared following the same library
preparation protocol.
- You know the ploidy of at least a subset of 60
samples or you know the most common ploidy in
the dataset.
- Your dataset includes heterozygous samples.
When does the Qploidy
methodology NOT work?
The methodology will not be effective under the following
circumstances:
- Your marker data comes from RADseq or
GBS (Genotyping-by-Sequencing) platforms.
- You intend to combine datasets from different sequencing
batches.
- For example: If you extracted DNA and sequenced two plates (192
samples) as one batch, and later sequenced an additional three plates
(288 samples) as a second batch, you would need to analyze the two
batches separately in Qploidy
. Combining
all 480 samples into a single analysis will lead to incorrect
results.
- You do not have a subset of samples with known ploidy
or lack a predominant ploidy in your dataset.
- Your samples consist of inbred lines (homozygous
individuals).
Installation
To install the Qploidy
package, you can use the
following command:
Input files
In this vignette, we demonstrate the workflow using a simulated VCF file containing 50 samples and 500 genetic markers, allowing for a concise and computationally efficient example. We also present results derived from a real-world rose Axiom array dataset, which includes 524 garden rose cultivars and a total of 137,786 genetic markers. For additional context and methodological details, please refer to this this publication.
Target sequencing VCF file
Let’s generate our simulated VCF file. This file contains 500 markers and 60 samples, with a mix of tetraploid, diploid, and triploid samples. The tetraploid samples are used as the reference for standardization because they are the most common ploidy in the dataset.
library(Qploidy)
# Simulate a VCF file with 500 markers and 60 samples
simulate_vcf(
seed = 1234, file_path = "vcf_example_simulated.vcf",
n_tetraploid = 35, n_diploid = 5, n_triploid = 10,
n_markers = 500
)
## MarkerName SampleName X Y R ratio
## 1 chr1_mk1 Tetraploid1 62 153 215 0.7116279
## 2 chr1_mk1 Tetraploid2 98 104 202 0.5148515
## 3 chr1_mk1 Tetraploid3 97 91 188 0.4840426
## 4 chr1_mk1 Tetraploid4 156 56 212 0.2641509
## 5 chr1_mk1 Tetraploid5 158 44 202 0.2178218
## 6 chr1_mk1 Tetraploid6 147 48 195 0.2461538
From Axiom array summary file
In case your samples were genotyped using the Axiom array platform,
you should have a summary
file or .CEL
files.
If you have .CEL
files, you can convert them into a
summary
file using the apt-probeset-genotype
plugin available in the Analysis Power Tools (APT) platform from Thermo
Fisher Scientific (Waltham, MA). A summary
is composed by a
header that can have many lines and a body that looks like this:
## probeset_id Sample1 Sample2 Sample3 Sample4 Sample5
## 1 AX-95696335-A 66.387446 18.201038 104.790822 105.118016 104.5052361
## 2 AX-76690965-B 21.957280 7.406288 11.772518 53.006989 0.8532014
## 3 AX-83704427-A 12.368064 49.120263 110.625211 99.037832 98.9612293
## 4 AX-50778632-B 4.733086 17.542838 5.805427 4.022787 2.4019124
## 5 AX-52238885-A 97.683744 116.197899 96.799558 92.523135 5.5733040
## 6 AX-36184579-B 60.884661 16.177990 46.776783 12.399668 10.0071255
## Sample6 Sample7 Sample8 Sample9 Sample10
## 1 102.55972 118.808939 102.848765 33.786032 97.345342
## 2 70.20994 15.659821 91.132107 3.416389 3.827478
## 3 46.44331 62.658054 106.543788 105.964029 91.509631
## 4 52.23666 35.481627 21.569666 123.018371 7.954378
## 5 107.16436 7.086115 12.757439 109.580758 96.399672
## 6 10.85804 59.444760 2.278098 74.499855 87.020296
In Qploidy
, use the read_axiom
function to
read the summary
file. It skips the header and reads the
body of the file. The function will return a data frame.
MarkerName SampleName X Y R ratio
1 AX-86752740 Lemon_Fiz 977.7548 793.4656 1771.220 0.4479768
2 AX-86752740 High_Voltage 1292.2205 582.7886 1875.009 0.3108191
3 AX-86752740 Lemon_Fiz.1 919.8055 870.9619 1790.767 0.4863624
4 AX-86752740 High_Voltage.1 1368.1266 645.9481 2014.075 0.3207170
5 AX-86752740 Brite_Eyes 238.5416 1242.7593 1481.301 0.8389648
6 AX-86752740 Morden_Blush.1 302.1575 1542.4037 1844.561 0.8361900
From Illumina array
In case you have a Illumina array summary file, Qploidy
expects the following format:
[Header]
GSGT Version 2.0.5
Processing Date 4/17/2023 9:31 AM
Content specieX.bpm
Num SNPs 26251
Total SNPs 26251
Num Samples 100
Total Samples 100
[Data]
SNP Name Sample ID GC Score Theta X Y X Raw Y Raw Log R Ratio
mk-1 SAMP-2 0.6814 0.247 0.023 0.009 504 100 0.5211668
mk-10 SAMP-2 0.9204 0.000 0.006 0.000 346 27 -1.021583
mk-100 SAMP-2 0.1695 0.367 0.030 0.020 576 158 1.177404
mk-101 SAMP-2 0.8655 0.222 0.026 0.010 538 102 0.3459635
mk-103 SAMP-2 0.8932 0.000 0.019 0.000 465 18 0.7578704
mk-104 SAMP-2 0.5146 0.175 0.041 0.012 675 114 1.016254
mk-105 SAMP-2 0.8060 0.450 0.018 0.015 458 132 0.4305636
mk-106 SAMP-2 0.0000 0.386 0.002 0.002 315 57 NaN
mk-107 SAMP-2 0.4804 0.495 0.026 0.025 534 189 1.0151
Verify if you file look like this one. If confirmed, you can use the
read_illumina_array
function to read the data. If you
include more than one file, the function will merge them into a single
data frame.
data <- read_illumina_array(
"data/illumina/Set1.txt", # This is a example code, data not available
"data/illumina/Set2.txt",
"data/illumina/Set3.txt"
)
head(input_data)
MarkerName SampleName X Y R ratio
1 CT1 S1-2_1 0.023 0.009 0.032 0.2812500
2 CT10 S1-2_1 0.006 0.000 0.006 0.0000000
3 CT100 S1-2_1 0.030 0.020 0.050 0.4000000
4 CT101 S1-2_1 0.026 0.010 0.036 0.2777778
5 CT103 S1-2_1 0.019 0.000 0.019 0.0000000
6 CT104 S1-2_1 0.041 0.012 0.053 0.2264151
Define Reference Samples - Choose Path (1) or (2)
Qploidy
applies a standardization method to allele
intensities or read counts to enable the comparison of copy numbers
across the genome of a sample. To achieve this, it requires a set of
reference samples. There are two ways to select these reference
samples:
- If you have a subset of samples with known ploidy,
Qploidy
will use this subset as the reference.
- If you know the most common ploidy among your
samples, you can initially run
Qploidy
using all samples as the reference. From this first run, you can identify samples with the common ploidy, then select this subset to use as the reference in a second run ofQploidy
.
Below are the steps to proceed for each case, tagged as (1) and (2):
(1) Using a Subset with Known Ploidy
For this method, you will need to separate your subset of samples with known ploidy:
In our simulated example, the sample names tell us which ploidy they are. Let’s just create a vector with the tetraploid samples for us to use later:
## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6" "Tetraploid7" "Tetraploid8" "Tetraploid9" "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"
## [36] "Diploid1" "Diploid2" "Diploid3" "Diploid4" "Diploid5"
## [41] "Triploid1" "Triploid2" "Triploid3" "Triploid4" "Triploid5"
## [46] "Triploid6" "Triploid7" "Triploid8" "Triploid9" "Triploid10"
## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6" "Tetraploid7" "Tetraploid8" "Tetraploid9" "Tetraploid10"
## [11] "Tetraploid11" "Tetraploid12" "Tetraploid13" "Tetraploid14" "Tetraploid15"
## [16] "Tetraploid16" "Tetraploid17" "Tetraploid18" "Tetraploid19" "Tetraploid20"
## [21] "Tetraploid21" "Tetraploid22" "Tetraploid23" "Tetraploid24" "Tetraploid25"
## [26] "Tetraploid26" "Tetraploid27" "Tetraploid28" "Tetraploid29" "Tetraploid30"
## [31] "Tetraploid31" "Tetraploid32" "Tetraploid33" "Tetraploid34" "Tetraploid35"
(1) and (2) Run Dosage Caller
To proceed, you need to determine the dosage for all reference samples. If you haven’t done this yet, you can use any suitable dosage caller.
- For array data, we recommend using the
fitPoly
package.
- For sequencing data, we suggest using the
updog
orpolyRAD
packages.
These packages determine dosages by analyzing the distribution of allele intensities or read counts across all samples for each marker, using an informed ploidy model. You will provide either the most common ploidy in your population or the known ploidy of your selected subset as the informed ploidy.
- Path (2): If you provide the most common ploidy, samples with different ploidy levels may receive incorrect dosages. However, these samples can be filtered out in the next step.
Warning:
Depending on the number of samples and markers, this step may take a
significant amount of time to complete. It is highly recommended to run
it on a high-performance computing system where you can utilize multiple
cores and avoid issues with excessive RAM usage.
Target sequencing VCF data
- If the VCF file already contains dosage/genotype information called for the subset/most common ploidy:
## [1] 50000 4
## MarkerName Chromosome Position
## 1 chr1_mk1 chr1 249732
## 2 chr1_mk2 chr1 347883
## 3 chr1_mk3 chr1 709781
## 4 chr1_mk4 chr1 758053
## 5 chr1_mk5 chr1 794442
## 6 chr1_mk6 chr1 823933
- If the VCF file lacks marker names in the
ID
column of the fixed portion, theMarkerName
column will containNAs
. To resolve this, you can use the following code to concatenate the chromosome and position as marker names:
(1) Filter the genos object to keep only the known ploidy samples subset
- If the VCF file doesn’t contain dosage/genotype information (GT format field) called for the subset/most common ploidy:
library(tidyr)
# Prepare inputs for updog
## Approach (1)
# tobe_genotyped <- data[which(data$SampleName %in% tetraploid_samples),]
## Approach (2)
tobe_genotyped <- data
ref <- pivot_wider(tobe_genotyped[, 1:3], names_from = SampleName, values_from = X)
ref <- as.matrix(ref)
rownames_ref <- ref[, 1]
ref <- ref[, -1]
ref <- apply(ref, 2, as.numeric)
rownames(ref) <- rownames_ref
size <- pivot_wider(tobe_genotyped[, c(1, 2, 5)], names_from = SampleName, values_from = R)
size <- as.matrix(size)
rownames_size <- size[, 1]
size <- size[, -1]
size <- apply(size, 2, as.numeric)
rownames(size) <- rownames_size
size[1:10, 1:10]
## Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1 215 202 188 212 202
## chr1_mk2 184 194 208 205 213
## chr1_mk3 182 200 193 196 214
## chr1_mk4 189 202 202 203 203
## chr1_mk5 194 209 200 199 207
## chr1_mk6 186 206 178 195 205
## chr1_mk7 216 200 206 191 193
## chr1_mk8 196 215 201 209 199
## chr1_mk9 198 216 211 197 205
## chr1_mk10 204 195 205 187 205
## Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1 195 195 205 210 196
## chr1_mk2 200 195 209 200 195
## chr1_mk3 205 199 205 194 220
## chr1_mk4 215 204 218 203 188
## chr1_mk5 213 208 191 191 208
## chr1_mk6 203 191 200 197 200
## chr1_mk7 214 198 204 203 198
## chr1_mk8 212 200 195 197 205
## chr1_mk9 204 198 200 189 208
## chr1_mk10 212 200 201 199 200
## Tetraploid1 Tetraploid2 Tetraploid3 Tetraploid4 Tetraploid5
## chr1_mk1 62 98 97 156 158
## chr1_mk2 93 92 56 105 158
## chr1_mk3 143 110 146 52 98
## chr1_mk4 43 102 149 158 95
## chr1_mk5 101 107 200 153 151
## chr1_mk6 148 102 40 49 56
## chr1_mk7 112 0 150 141 94
## chr1_mk8 50 56 54 155 101
## chr1_mk9 98 104 58 101 108
## chr1_mk10 145 148 151 145 106
## Tetraploid6 Tetraploid7 Tetraploid8 Tetraploid9 Tetraploid10
## chr1_mk1 147 149 154 99 145
## chr1_mk2 103 50 104 0 45
## chr1_mk3 150 53 100 105 109
## chr1_mk4 55 50 105 165 144
## chr1_mk5 56 154 154 98 101
## chr1_mk6 150 94 0 147 0
## chr1_mk7 70 99 50 98 154
## chr1_mk8 151 151 140 46 114
## chr1_mk9 54 44 101 86 156
## chr1_mk10 96 200 152 150 200
library(updog)
multidog_obj <- multidog(
refmat = ref,
sizemat = size,
model = "norm", # It depends of your population structure
ploidy = 4, # Most common ploidy in your population
nc = 6
) # Change the parameters accordingly!!
## | *.#,%
## ||| *******/
## ||||||| (**..#**. */ **/
## ||||||||| */****************************/*%
## ||| &****..,*.************************/
## ||| (....,,,*,...****%********/(******
## ||| ,,****%////,,,,./.****/
## ||| /**// .*///....
## ||| .*/*/%# .,/ .,
## ||| , **/ #% .* ..
## ||| ,,,*
##
## Working on it...done!
genos <- data.frame(
MarkerName = multidog_obj$inddf$snp,
SampleName = multidog_obj$inddf$ind,
geno = multidog_obj$inddf$geno,
prob = multidog_obj$inddf$maxpostprob
)
head(genos)
## MarkerName SampleName geno prob
## 1 chr1_mk1 Tetraploid1 1 0.9999977
## 2 chr1_mk1 Tetraploid2 2 0.9999999
## 3 chr1_mk1 Tetraploid3 2 0.9999999
## 4 chr1_mk1 Tetraploid4 3 0.9999997
## 5 chr1_mk1 Tetraploid5 3 1.0000000
## 6 chr1_mk1 Tetraploid6 3 1.0000000
Notice that, if you are following approach (2) diploids and triploids samples genotypes will be wrongly called as tetraploids. But, because they are not the most common samples in the dataset, they should not interfere significantly in the dosage call of the tetraploids. You can go back to this step after a first round of ploidy evaluation and call the dosage of the tetraploids only.
Array data
Here we show an option of genotype calling for array data using fitpoly software:
library(fitPoly)
saveMarkerModels(
ploidy = 4, # Most common ploidy among Texas Garden Roses collection
data = data_reference, # output of the previous function
filePrefix = "fitpoly_output/texas_roses", # Define the path and prefix for the result files
ncores = 2
) # Change it accordingly to your system!!!!
library(vroom) # Package to speed up reading files
fitpoly_scores <- vroom("fitpoly_output/texas_roses_scores.dat")
genos <- data.frame(
MarkerName = fitpoly_scores$MarkerName,
SampleName = fitpoly_scores$SampleName,
geno = fitpoly_scores$maxgeno,
prob = fitpoly_scores$maxP
)
head(genos)
MarkerName SampleName geno prob
1 AX-86752740 1000 Wishes 2 0.7831438
2 AX-86752740 10004-N008 2 0.9892338
3 AX-86752740 10037_N046 2 0.9844171
4 AX-86752740 10038-N001 2 0.9916813
5 AX-86752740 10043_N019 2 0.9871057
6 AX-86752740 10043_N049 2 0.9955792
Differently than a VCF file, a summary
file does not
contain the genomic positions of the markers. You will need to provide
these information in a separate file as the following example.
# File containing markers genomic positions
genos.pos_file <- read.table("geno.pos_roses_texas.txt", header = T)
head(genos.pos)
SNP probes chr pos
1 Affx-86843634 AX-86752747 6 255852
2 Affx-86842821 AX-86752763 6 62032267
3 Affx-86839613 AX-86752769 5 9169310
4 Affx-86838724 AX-86752790 1 44259488
5 Affx-86840823 AX-86752809 7 12991931
6 Affx-86842443 AX-86752817 2 53861719
# Edit for Qploidy input format
genos.pos <- data.frame(
MarkerName = genos.pos_file$probes,
Chromosome = genos.pos_file$chr,
Position = genos.pos_file$pos
)
head(genos.pos)
MarkerName Chromosome Position
1 AX-86752747 6 255852
2 AX-86752763 6 62032267
3 AX-86752769 5 9169310
4 AX-86752790 1 44259488
5 AX-86752809 7 12991931
6 AX-86752817 2 53861719
(1) and (2) Standardization
## Generating standardize BAFs...
## [1] "Percentage of genotypes turned into missing data because of low genotype probability:2.28"
## [1] "Markers remove because of excess of missing data:0"
## Going to parallel mode...
## Back to single core usage
## [1] "Markers remove because of smaller number of clusters than set threshold:13"
## Going to parallel mode...
## Back to single core usage
## BAFs ready!
## Generating z scores...
## Z scores ready!
## Merging results into qploidy_standardization object...
## Writting Qploidy app input file:/var/folders/0v/yvmmxrp55jxdjsky45t_cmt40000gp/T//Rtmpi15GPE/file42c54d9e36ce.tsv.gzDone!
## This is on object of class 'ploidy_standardization'
## --------------------------------------------------------------------
## Parameters
##
## 1 standardization type: intensities
## 2 Ploidy: 4
## 3 Minimum number of heterozygous classes (clusters) present: 5
## 4 Maximum number of missing genotype by marker: 0.1
## 5 Minimum genotype probability: 0.8
## --------------------------------------------------------------------
## Filters
##
## 1 Number of markers at raw data:
## 2 Percentage of filtered genotypes by probability threshold:
## 3 Number of markers filtered by missing data:
## 4 Number of markers filtered for not having the minimum number of clusters:
## 5 Number of markers filtered for not having genomic information:
## 6 Number of markers with estimated BAF:
##
## 1 1000 (100%)
## 2 - (2.28 %)
## 3 0 (0 %)
## 4 13 (1.3 %)
## 5 0 (0 %)
## 6 987 (98.7 %)
simu_data_standardized <- standardize(
data = data,
genos = genos,
geno.pos = genos.pos,
ploidy.standardization = 4,
threshold.n.clusters = 5,
n.cores = 1,
out_filename = "my_standardized_data.tsv.gz",
verbose = TRUE
)
simu_data_standardized
How it look like for our real roses dataset:
This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
1 standardization type: intensities
2 Ploidy: 4
3 Minimum number of heterozygous classes (clusters) present: 5
4 Maximum number of missing genotype by marker: 0.1
5 Minimum genotype probability: 0.8
--------------------------------------------------------------------
Filters
1 Number of markers at raw data: 137786 (100%)
2 Percentage of filtered genotypes by probability threshold: - (16.79 %)
3 Number of markers filtered by missing data: 5659 (4.11 %)
4 Number of markers filtered for not having the minimum number of clusters: 101853 (73.92 %)
5 Number of markers filtered for not having genomic information: 252 (0.18 %)
6 Number of markers with estimated BAF: 28100 (20.39 %)
See the details of the parameters with ?standardize
.
The print
function of the resulting object provides
information about marker filtering during the process. One critical
filtering step involves the number of dosage clusters represented for
each marker. If your samples are highly inbred at certain loci, you
might not have individuals representing all possible dosages for those
marker locations.
For example, when using tetraploids as the reference, some markers
may have individuals with dosages of only 0, 1, 3, and 4 (missing dosage
2). Without representatives for the missing dosage, Qploidy
standardization cannot be performed.
- If
threshold.n.clusters
is set toploidy + 1
, markers with missing dosages will be discarded.
- If
threshold.n.clusters
is smaller thanploidy + 1
, the missing dosage cluster centers will be imputed for standardization purposes.
Recover the ploidy_standardization
Object from a Saved
File
To revisit this step later without re-running the upstream functions,
you can load the previously generated file using the
read_qploidy_standardization
function:
## Rows: 1 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): out_filename, type
## dbl (4): threshold.missing.geno, threshold.geno.prob, ploidy.standardization...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 1 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (6): n.markers.start, geno.prob.rm, miss.rm, clusters.rm, no.geno.info.r...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 50000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): MarkerName, SampleName, Chr
## dbl (8): X, Y, R, ratio, geno, baf, Position, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(1) and (2) Plot results
Qploidy
provides several options for visualization of
the results. Check the complete description with
?plot_qploidy_standaridization
. Bellow you can see
examples:
# Select a sample to be evaluated
samples <- unique(simu_data_standardized$data$SampleName)
head(samples)
## [1] "Tetraploid1" "Tetraploid2" "Tetraploid3" "Tetraploid4" "Tetraploid5"
## [6] "Tetraploid6"
sample <- "Tetraploid1"
# Proportion of heterozygous loci, BAF (Qploidy standardized ratio), and zscore by genomic positions
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("het", "BAF", "zscore"),
dot.size = 0.05,
chr = 1:2
)
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# Heterozygous frequency, BAF (Qploidy standardized ratio), and zscore by genomic positions
# centromere positions added
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("het", "BAF", "zscore"),
dot.size = 0.05,
chr = 1:2,
add_centromeres = TRUE,
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# Raw allele intensity or read count ratio and BAF (Qploidy standardized ratio) histograms
# combining all markers in the sample (sample level resolution)
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("Ratio_hist_overall", "BAF_hist_overall"),
chr = 1:2,
ploidy = 4,
add_expected_peaks = TRUE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that, because it is a simulated data, the ratio and the BAF don’t differ much. See how it look like for our real roses dataset:
# BAF histograms (chromosome level resolution) and Z score
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("BAF_hist", "zscore"),
chr = 1:2,
add_expected_peaks = TRUE,
ploidy = c(4, 4)
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
# BAF histograms combining all markers in the sample (chromosome-arm level resolution) and Z score
p <- plot_qploidy_standardization(
x = simu_data_standardized,
sample = sample,
type = c("BAF_hist", "zscore"),
chr = 1:2,
ploidy = c(4, 4, 4, 4), # Provide ploidy for each arm
add_expected_peaks = TRUE,
add_centromeres = TRUE,
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
See how it look like for our real roses dataset:
It is important to note that the quality of results can vary across
samples, as observed in different plots. Key aspects to assess
include:
- How sharp the peaks are in the histogram plots.
- How well the dots cluster in the BAF vs. genomic position plots.
- Whether the patterns match the decay or increase of the Z-score
(Z).
In our publication, we categorize sample quality into different
resolutions based on the ability to estimate copy
numbers, ranked from highest to lowest resolution:
1. Chromosome-arm resolution: When the copy number of
all chromosome arms can be estimated.
2. Chromosome resolution: When at least one chromosome
arm’s copy number cannot be estimated.
3. Sample resolution: When the copy number of at least
one entire chromosome cannot be estimated.
4. None: When the histogram peaks, considering all
markers, do not fit any of the expected ploidy levels tested.
For more details and examples, please refer to the publication.
(1) and (2) Ploidy Estimation for All Samples
Warning: This method may not be fully accurate in certain situations. While it offers a helpful initial guide, visual inspection of the plots described in the previous section is essential to assess the resolution and confirm the ploidy.
# If sample level resolution
estimated_ploidies_sample <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized, # standardization result object
samples = "all", # Samples or "all" to estimate all samples
level = "sample", # Resolution level
ploidies = c(2, 3, 4, 5)
) # Ploidies to be tested
estimated_ploidies_sample
## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 1,3,5,2,6,7,4,0
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 459
5 Number of potential aneuploid samples: 0
6 Number of highly inbred samples: 65
Highly inbred samples cannot be evaluated for copy number using this
method. As a result, Qploidy
returns a missing value
(NA
) for these cases.
## [,1]
## Tetraploid1 4
## Tetraploid2 4
## Tetraploid3 4
## Tetraploid4 4
## Tetraploid5 4
## Tetraploid6 4
See how it look like for our real roses dataset:
[,1]
Lemon_Fiz 4
High_Voltage 4
Lemon_Fiz.1 4
High_Voltage.1 4
Brite_Eyes 4
Morden_Blush.1 4
# If chromosome resolution
estimated_ploidies_chromosome <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized,
samples = "all",
level = "chromosome",
ploidies = c(2, 3, 4, 5)
)
estimated_ploidies_chromosome
## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1,2,3,4,5,6,7
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 375
5 Number of potential aneuploid samples: 92
6 Number of highly inbred samples: 57
## chr1 chr2
## Tetraploid1 4 4
## Tetraploid2 4 4
## Tetraploid3 4 4
## Tetraploid4 4 4
## Tetraploid5 4 4
## Tetraploid6 4 4
See how it look like for our real roses dataset:
0 1 2 3 4 5 6 7
Lemon_Fiz 4 4 4 4 4 4 4 4
High_Voltage 4 4 4 4 4 4 4 4
Lemon_Fiz.1 4 4 4 4 4 4 4 4
High_Voltage.1 4 4 4 4 4 4 4 4
Brite_Eyes 4 4 4 4 4 4 4 4
Morden_Blush.1 4 4 4 4 4 4 4 4
# If chromosome-arm resolution
estimated_ploidies_chromosome_arm <- area_estimate_ploidy(
qploidy_standardization = simu_data_standardized,
samples = "all",
level = "chromosome-arm",
ploidies = c(2, 3, 4, 5),
centromeres = c("chr1" = 15000000, "chr2" = 19000000)
)
estimated_ploidies_chromosome_arm
## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1.1,chr1.2,chr2.1,chr2.2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1.1,1.2,2.1,2.2,3.1,3.2,4.1,4.2,5.1,5.2,6.1,6.2,7.1,7.2
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 304
5 Number of potential aneuploid samples: 171
6 Number of highly inbred samples: 49
## chr1.1 chr1.2 chr2.1 chr2.2
## Tetraploid1 4 4 4 4
## Tetraploid2 4 4 4 4
## Tetraploid3 4 4 4 4
## Tetraploid4 4 4 4 4
## Tetraploid5 4 4 4 4
## Tetraploid6 4 4 4 4
See how it look like for our real roses dataset:
0 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2
Lemon_Fiz 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
High_Voltage 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Lemon_Fiz.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
High_Voltage.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Brite_Eyes 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Morden_Blush.1 4 4 4 4 4 NA 4 4 4 4 4 4 4 4 4
Note that one of the chromosome arms (Morden_Blush.1 - 3.2) returned
a value of NA
. This occurs because the arm contains a high
number of inbred loci.
estimated_ploidies_format <- merge_arms_format(estimated_ploidies_chromosome_arm)
estimated_ploidies_format
## Object of class qploidy_area_ploidy_estimation
## 1 Number of samples: 50
## 2 Chromosomes: chr1,chr2
## 3 Tested ploidies: 1,2,3,4,5
## 4 Number of euploid samples: 50
## 5 Number of potential aneuploid samples: 0
## 6 Number of highly inbred samples: 0
See how it look like for our real roses dataset:
Object of class qploidy_area_ploidy_estimation
1 Number of samples: 524
2 Chromosomes: 0,1,2,3,4,5,6,7
3 Tested ploidies: 1,2,3,4,5
4 Number of euploid samples: 304
5 Number of potential aneuploid samples: 171
6 Number of highly inbred samples: 49
## chr1 chr2
## Tetraploid1 4 4
## Tetraploid2 4 4
## Tetraploid3 4 4
## Tetraploid4 4 4
## Tetraploid5 4 4
## Tetraploid6 4 4
See how it look like for our real roses dataset:
0 1 2 3 4 5 6 7
Lemon_Fiz "4" "4" "4" "4" "4" "4" "4" "4"
High_Voltage "4" "4" "4" "4" "4" "4" "4" "4"
Lemon_Fiz.1 "4" "4" "4" "4" "4" "4" "4" "4"
High_Voltage.1 "4" "4" "4" "4" "4" "4" "4" "4"
Brite_Eyes "4" "4" "4" "4" "4" "4" "4" "4"
Morden_Blush.1 "4" "4" "4" "NA/4" "4" "4" "4" "4"
Export ploidy result table
(1) and (2) Save plots for all samples
Facilitate individual visual inspection by generating figures for all samples in a loop:
dir.create("figures")
for (i in seq_along(samples)) {
print(paste0("Part:", i, "/", length(samples)))
print(paste("Generating figures for", samples[i], "..."))
# This function will generate figures for sample, chromosome and chromosome-arm resolution level evaluation
all_resolutions_plots(
data_standardized = simu_data_standardized,
sample = samples[i],
ploidy = estimated_ploidies_chromosome$ploidy[i, 1:2],
chr = 1:2,
centromeres = c("chr1" = 15000000, "chr2" = 19000000),
file_name = paste0("figures/", samples[i])
)
print(paste("Done!"))
}
By visualizing the plots, you can correct the results from
area_estimate_ploidy
and add resolution information for
each sample. In our perfect simualted data, all samples correctly
identified therefore we will just move forward with the analysis.
(2) Improve standardization selecting a known ploidy subset after first round
If you used the most common ploidy as the reference instead of a
subset with known ploidy, the standardization may not be as accurate.
You can improve the resolution by following these steps:
i. Select the highest resolution plots from the first standardization
round.
ii. Filter these samples from the data
object.
iii. Run standardization again using this subset as the reference.
iv. Reevaluate the plots and estimate the ploidy once more.
# i) Select the highest resolution plots from the first standardization round.
tetraploids <- apply(ploidy_corrected, 1, function(x) all(x == 4))
tetraploids
data_reference2 <- rownames(ploidy_corrected)[which(tetraploids)]
length(data_reference2)
# ii) Filter these samples from the `data` object.
genos_round2 <- genos[which(genos$SampleName %in% data_reference2), ]
# iii) Run standardization again using this subset as the reference.
simu_data_standardized_round2 <- standardize(
data = data,
genos = genos_round2,
geno.pos = genos.pos,
ploidy.standardization = 4,
threshold.n.clusters = 5,
n.cores = 1,
out_filename = "my_round2_standardization.tsv.gz",
verbose = TRUE
)
simu_data_standardized_round2
For our perfect simulated data, plot resolution didn’t change between first and second round but see how it look like for our real roses dataset:
This is on object of class 'ploidy_standardization'
--------------------------------------------------------------------
Parameters
1 standardization type: intensities
2 Ploidy: 4
3 Minimum number of heterozygous classes (clusters) present: 5
4 Maximum number of missing genotype by marker: 0.1
5 Minimum genotype probability: 0.8
--------------------------------------------------------------------
Filters
1 Number of markers at raw data: 137786 (100%)
2 Percentage of filtered genotypes by probability threshold: - (16.34 %)
3 Number of markers filtered by missing data: 6108 (4.43 %)
4 Number of markers filtered for not having the minimum number of clusters: 111977 (81.27 %)
5 Number of markers filtered for not having genomic information: 145 (0.11 %)
6 Number of markers with estimated BAF: 17634 (12.8 %)
New plots for roses real data after round 1 and 2 of standardization:
How to cite
Taniguti, C.H; Lau, J.; Hochhaus, T.; Lopez Arias, D. C.; Hokanson, S.C.; Zlesak, D. C.; Byrne, D. H.; Klein, P.E. and Riera-Lizarazu, O. Exploring Chromosomal Variations in Garden Roses: Insights from High-density SNP Array Data and a New Tool, Qploidy. 2025. Submitted.
Acknowledgments
This work is funded in part by the Robert E. Basye Endowment in Rose Genetics, Dept. of Horticultural Sciences, Texas A&M University, and USDA’s National Institute of Food and Agriculture (NIFA), Specialty Crop Research Initiative (SCRI) projects: ‘‘Tools for Genomics-Assisted Breeding in Polyploids: Development of a Community Resource’’ (Award No. 2020-51181-32156); and ‘‘Developing Sustainable Rose Landscapes via Rose Rosette Disease Education, Socioeconomic Assessments, and Breeding RRD-Resistant Roses with Stable Black Spot Resistance’’ (Award No. 2022-51181-38330).