
Evidence-based advice on the number of gene-drop iterations a pedigree needs
Source:R/gvaConvergence.R
gvaConvergence.RdPart of Genetic Value Analysis
Usage
gvaConvergence(
ped,
pop = NULL,
nMax = 3000L,
guThresh = 1L,
byID = TRUE,
grid = NULL,
k = 20L,
oMin = 0.9,
rhoMin = 0.95,
seed = NULL,
updateProgress = NULL,
breedingTable = NULL,
gestationTable = NULL,
breedingAgeDefault = NULL,
gestationDefault = NULL
)Arguments
- ped
The pedigree information in data.frame format (the same input
reportGVtakes).- pop
Character vector with animal IDs to consider as the population of interest. The default is NULL (all animals).
- nMax
Integer gene-drop budget: the number of iteration columns to simulate. Reproducibility is assessed for iteration counts
Nwith2 * N <= nMax(each half-split needs2 * Ncolumns). Default 3000.- guThresh
Integer threshold number of animals for defining a rare (unique) allele, passed to
calcGU/calcA. Default 1.- byID
Logical passed to
alleleFreq()viacalcA; if TRUE, homozygous alleles are counted once per individual. Default TRUE.- grid
Integer vector of candidate iteration counts to assess. The default builds
c(25, 50, 100, 200, 400, 800, 1500), keeping only those with2 * N <= nMax.- k
Integer size of the top-
kselected set compared for overlap. Default 20.- oMin
Numeric minimum top-
koverlap for reproducibility. Default 0.90.- rhoMin
Numeric minimum Kendall rank agreement for reproducibility. Default 0.95.
- seed
Optional integer; when supplied,
set_seedpins the gene-drop RNG so the convergence curve is reproducible. Default NULL.- updateProgress
Function or NULL passed through to
geneDrop()to update ashiny::Progressobject. Default NULL.- breedingTable, gestationTable, breedingAgeDefault, gestationDefault
Optional overrides for the unknown-parent mean-kinship correction, passed through to
correctUnknownParentMeanKinship()exactly asreportGVpasses them (issue #73 Part 2). NULL uses the bundled defaults.
Value
An object of class nprcgenekeeprGVConv: a list with
convergence– a data.frame with one row per assessed iteration count:iterations,topOverlap(top-kselected-set overlap, from 0 to 1), andrankAgreement(Kendall rank agreement of the commonly-ranked animals, from -1 to 1).recommendedIter– the smallest assessed iteration count meeting both criteria, orNAif none did withingrid.converged–TRUEif any assessed count met both criteria.criteria– thek,oMin, andrhoMinused.nRankable– the number of probands carrying a (non-NA) rank that the order metrics are computed on.nUndetermined– the count of the excluded issue #76 Undetermined set (2C).nMax– the gene-drop budget actually simulated.
Details
Genome uniqueness (calcGU) is the only ranked Genetic Value
Analysis output that carries Monte Carlo (gene-drop) sampling noise, so the
number of iterations a colony actually needs is pedigree-dependent: there is
no single universal "right" count. gvaConvergence answers issue #2's
literal ask – "define reproducible and automate finding the needed number of
iterations" – on the ratified definition that the decision-relevant quantity
is the selection order (which animals are chosen, and in what order),
not the precision of the gu number itself.
Because the n gene-drop iteration columns are independent and
identically distributed replicates, the whole convergence picture is
recoverable from a single completed gene drop the user already pays
for: gvaConvergence runs one gene drop at nMax, computes the
per-iteration rare-allele matrix once (calcA), and for each
candidate iteration count N in grid splits the columns into two
disjoint halves of N columns each. The two halves are genuinely
independent N-iteration estimates; each is ranked through the same
ordering pipeline the report uses, and the two orderings are compared. A run
is judged reproducible at N when both
the top-
kselected animals overlap by at leastoMin(the same animals are chosen), andthe Kendall rank agreement of the commonly-ranked animals is at least
rhoMin(they come out in the same order).
The recommended iteration count is the smallest N in grid at
which both criteria hold. The issue #76 de-inflated gu = 0
"Undetermined" set (both parents unknown, no recorded origin) is a policy
constant with rank NA; it is excluded from the order the criteria are
computed on and reported separately as nUndetermined.
Because the half-split compares two N-column runs to each other
(never to their pooled mean), it is a conservative, self-validating estimate
of reproducibility at N. This changes nothing about seeding: a fixed
seed already makes gu bit-identical run to run; that is
reproducibility of the process, whereas this function reports the
sampling reproducibility of the estimate.
Examples
library(nprcgenekeepr)
## A quick, small illustration (use a larger nMax in practice).
conv <- gvaConvergence(nprcgenekeepr::qcPed, nMax = 200L, seed = 1L)
conv$convergence
#> iterations topOverlap rankAgreement
#> 1 25 1 1
#> 2 50 1 1
#> 3 100 1 1
conv$recommendedIter
#> [1] 25