Skip to contents

Overview

Every analysis in nprcgenekeepr – genetic value, breeding-group formation, the age-sex pyramid – starts from a quality-controlled pedigree. Real studbooks arrive with inconsistent column names, mixed date formats, sex codes that disagree with how an animal is actually used as a parent, duplicated records, and impossible parentages. qcStudbook() is the single function that validates and standardizes a studbook before anything else touches it; the Genetic Value Analysis and Forming Breeding Groups articles both call it as their very first step.

qcStudbook() performs several families of checks:

  • required columns – the studbook must contain id, sire, dam, sex, and birth;
  • identifier validity – animal IDs must be alphanumeric (a period is not allowed, because it breaks formulas, file names, and namespaces across software environments);
  • sex consistency – an animal used as a sire must be male and one used as a dam must be female;
  • date validity – birth and exit dates must parse to real calendar dates;
  • duplicate detection – the same ID must not appear in two conflicting records;
  • minimum parent age – a parent must be at least minParentAge years old (default 2.0) at an offspring’s birth.

It runs in one of two modes, controlled by reportErrors:

  • production mode (reportErrors = FALSE, the default) returns a clean, standardized data frame – silently correcting what it can safely repair and stopping on anything it cannot;
  • diagnostic mode (reportErrors = TRUE) does not stop. It scans the whole studbook and returns a list of everything it found, so you can see every problem in one pass and fix the source data.

Setup

qcStudbook() is deterministic – it runs no random simulation – so, unlike the genetic-value and breeding-group analyses (which seed a gene-drop simulation), this article needs no set_seed() call to be reproducible.

Cleaning a real studbook

The package ships examplePedigree, a realistic 3,694-animal studbook. Running it through qcStudbook() in production mode returns a standardized pedigree ready for analysis.

ped <- qcStudbook(examplePedigree,
  minParentAge  = 2.0,
  reportChanges = FALSE,
  reportErrors  = FALSE
)
dim(ped)
#> [1] 3694   12
names(ped)
#>  [1] "id"           "sire"         "dam"          "sex"          "gen"         
#>  [6] "birth"        "exit"         "age"          "ancestry"     "origin"      
#> [11] "status"       "recordStatus"
head(ped[, c("id", "sire", "dam", "sex", "gen", "birth", "exit", "age")])
#>       id sire  dam sex gen      birth       exit age
#> 1 01WY5E <NA> <NA>   M   0 2005-07-07 2005-08-26 0.1
#> 2 02GZ4L <NA> <NA>   U   0 2004-07-05 2004-08-15 0.1
#> 3 079ZJK <NA> <NA>   F   0       <NA> 1975-01-16  NA
#> 4 08CF4C <NA> <NA>   F   0       <NA> 1966-03-19  NA
#> 5 093AB5 <NA> <NA>   M   0 2004-09-03 2005-01-29 0.4
#> 6 0CRGND <NA> <NA>   F   0 1982-04-01 1982-04-27 0.1

The result has standardized columns, founders (animals with no known parent) marked generation 0, and an age column. Because examplePedigree already carries an age column, qcStudbook() keeps it rather than recomputing ages against the current date, so this article renders the same numbers every time. When the input has no age column, ages are computed from birth to exit – or to today’s date for animals still in the colony.

What QC standardizes

Before testing for errors, qcStudbook() normalizes the studbook so downstream code sees consistent names and codes. Column names are lower-cased and stripped of spaces, periods, and underscores, and common aliases are renamed (egoid / ego_id -> id, sireid -> sire, damid -> dam, birthdate -> birth). Setting reportChanges = TRUE records exactly what was renamed. The shipped pedGood data set has five columns – four with deliberately messy headers (ego_id, sire.id, dam_id, birth_date) and an already-canonical sex that QC leaves untouched:

chg <- qcStudbook(pedGood, reportChanges = TRUE, reportErrors = TRUE)
Filter(length, chg$changedCols)
#> $periodRemoved
#> [1] "sire.id to sireid"
#> 
#> $underScoreRemoved
#> [1] "ego_id, dam_id, and birth_date to egoid, damid, and birthdate"
#> 
#> $egoidToId
#> [1] "egoid to id"
#> 
#> $sireIdToSire
#> [1] "sireid to sire"
#> 
#> $damIdToDam
#> [1] "damid to dam"
#> 
#> $birthdateToBirth
#> [1] "birthdate to birth"

qcStudbook() also standardizes coded values: sex to M / F / U (from MALE / FEMALE / 1 / 2 / …; by default the hermaphrodite codes H / HERMAPHRODITE / 4 are folded into U), status to ALIVE / DECEASED / SHIPPED / UNKNOWN, and ancestry to INDIAN / CHINESE / HYBRID / JAPANESE / UNKNOWN / OTHER. Character dates are parsed to Date, and UNKNOWN parents become either NA or auto-generated Unnnn placeholder IDs.

Diagnosing problems without stopping

In diagnostic mode (reportErrors = TRUE) the function returns a list whose named elements each hold one category of finding. A clean studbook returns NULL – nothing to report:

is.null(qcStudbook(pedGood, reportErrors = TRUE))
#> [1] TRUE

The package ships small, purpose-built data sets that each trigger one kind of problem; they make good worked examples.

Sex inconsistency. In pedFemaleSireMaleDam, s1 is recorded female but is the sire of two offspring, and d1 is recorded male but is used as a dam:

qcStudbook(pedFemaleSireMaleDam,
  reportErrors = TRUE)[c("femaleSires", "maleDams")]
#> $femaleSires
#> [1] "s1"
#> 
#> $maleDams
#> [1] "d1"

Invalid dates. pedInvalidDates contains "205-06-19" (a three-digit year) and "2002-16-22" (month 16); the offending row numbers are reported:

qcStudbook(pedInvalidDates, reportErrors = TRUE)$invalidDateRows
#> [1] "3" "4"

Duplicate records. pedDuplicateIds repeats animal d1 in two rows:

qcStudbook(pedDuplicateIds, reportErrors = TRUE)$duplicateIds
#> [1] "d1"

A missing required column. pedMissingBirth has no birth column at all:

qcStudbook(pedMissingBirth, reportErrors = TRUE)$missingColumns
#> [1] "birth"

An impossible parentage. In pedSameMaleIsSireAndDam, s1 is listed as both a sire and a dam:

qcStudbook(pedSameMaleIsSireAndDam, reportErrors = TRUE)$sireAndDam
#> [1] "s1"

Diagnostic mode also reports parents younger than minParentAge at an offspring’s birth (in suspiciousParents) and any ID containing a period (in invalidIdChars).

Production mode vs diagnostic mode

The two modes treat the same problems differently. Production mode fixes what it can prove is safe and refuses to return a pedigree it cannot trust; diagnostic mode never stops, so a single pass surfaces every issue at once.

Problem Production mode (reportErrors = FALSE) Diagnostic mode (reportErrors = TRUE)
Messy column names corrected changedCols (with reportChanges = TRUE)
Female sire / male dam corrected (sex flipped) femaleSires / maleDams
Exact duplicate record removed duplicateIds
Missing required column stops missingColumns
Invalid date stops invalidDateRows
Sire is also a dam stops sireAndDam
Parent below minParentAge stops suspiciousParents
Period in an ID stops invalidIdChars

Production mode silently corrects the safe cases. We can confirm it on pedFemaleSireMaleDams1 becomes male and d1 female – and on pedDuplicateIds, where the repeated row is dropped:

qcStudbook(pedFemaleSireMaleDam)[, c("id", "sex")]
#>   id sex
#> 1 d1   F
#> 2 d2   F
#> 3 s1   M
#> 4 s2   M
#> 5 o1   F
#> 6 o2   F
#> 7 o3   F
#> 8 o4   M
nrow(pedDuplicateIds)             # before
#> [1] 9
nrow(qcStudbook(pedDuplicateIds)) # the duplicate row is gone
#> [1] 8

Run on data it cannot safely repair, production mode stops with a message naming the offending rows – here, the invalid dates:

tryCatch(
  qcStudbook(pedInvalidDates),
  error = function(e) cat(conditionMessage(e))
)
#> Column 'birth' has invalid dates on row(s) 3 and 4.

The recommended workflow follows from this: run once in diagnostic mode to see everything, correct the source studbook, then run in production mode to get the clean pedigree the analyses consume.

Key arguments

Argument Default Meaning
sb the studbook data frame to check
minParentAge 2.0 minimum age (years) a parent must be at an offspring’s birth; skipped when birth dates are missing
reportChanges FALSE record column-name corrections in changedCols
reportErrors FALSE diagnostic mode: scan everything and return a list of findings (or NULL) instead of a cleaned pedigree

See also

  • The Building a Focal-Animal Pedigree Offline article – build a focal pedigree from files with no database; quality-control it here as the next step.
  • The Genetic Value Analysis article – rank a quality-controlled pedigree by mean kinship and genome uniqueness with reportGV().
  • The Forming Breeding Groups article – assemble genetically diverse breeding groups from a quality-controlled pedigree with groupAddAssign().
  • The Age-Sex Pyramid Plots article – picture the colony’s age and sex structure with getPyramidPlot().
  • qcStudbook() – the function documented here.
  • runModularApp() – the Shiny app, whose Quality Control tab drives this same function interactively.

Reference.

Vinson A, Raboin MJ (2015). “A Practical Approach for Designing Breeding Groups to Maximize Genetic Diversity in a Large Colony of Captive Rhesus Macaques (Macaca mulatta).” Journal of the American Association for Laboratory Animal Science 54(6):700-707.