Studbook Quality Control

Overview

Every analysis in nprcgenekeepr – genetic value, breeding-group formation, the age-sex pyramid – starts from a quality-controlled pedigree. Real studbooks arrive with inconsistent column names, mixed date formats, sex codes that disagree with how an animal is actually used as a parent, duplicated records, and impossible parentages. qcStudbook() is the single function that validates and standardizes a studbook before anything else touches it; the Genetic Value Analysis and Forming Breeding Groups articles both call it as their very first step.

qcStudbook() performs several families of checks:

required columns – the studbook must contain id, sire, dam, sex, and birth;
identifier validity – animal IDs must be alphanumeric (a period is not allowed, because it breaks formulas, file names, and namespaces across software environments);
sex consistency – an animal used as a sire must be male and one used as a dam must be female;
date validity – birth and exit dates must parse to real calendar dates;
duplicate detection – the same ID must not appear in two conflicting records;
minimum parent age – a parent must be at least its minimum breeding age (minSireAge for sires, minDamAge for dams) at an offspring’s birth; both default to the species- and sex-specific value.

It runs in one of two modes, controlled by reportErrors:

production mode (reportErrors = FALSE, the default) returns a clean, standardized data frame – silently correcting what it can safely repair and stopping on anything it cannot;
diagnostic mode (reportErrors = TRUE) does not stop. It scans the whole studbook and returns a list of everything it found, so you can see every problem in one pass and fix the source data.

Setup

library(nprcgenekeepr)

qcStudbook() is deterministic – it runs no random simulation – so, unlike the genetic-value and breeding-group analyses (which seed a gene-drop simulation), this article needs no set_seed() call to be reproducible.

Cleaning a real studbook

The package ships examplePedigree, a realistic 3,694-animal studbook. Running it through qcStudbook() in production mode returns a standardized pedigree ready for analysis.

ped <- qcStudbook(examplePedigree,
  minSireAge    = 2.0,
  minDamAge     = 2.0,
  reportChanges = FALSE,
  reportErrors  = FALSE
)
dim(ped)
#> [1] 3694   13
names(ped)
#>  [1] "id"           "sire"         "dam"          "sex"          "gen"         
#>  [6] "birth"        "exit"         "age"          "ancestry"     "origin"      
#> [11] "status"       "recordStatus" "fromCenter"
head(ped[, c("id", "sire", "dam", "sex", "gen", "birth", "exit", "age")])
#>       id sire  dam sex gen      birth       exit age
#> 1 01WY5E <NA> <NA>   M   0 2005-07-07 2005-08-26 0.1
#> 2 02GZ4L <NA> <NA>   U   0 2004-07-05 2004-08-15 0.1
#> 3 079ZJK <NA> <NA>   F   0       <NA> 1975-01-16  NA
#> 4 08CF4C <NA> <NA>   F   0       <NA> 1966-03-19  NA
#> 5 093AB5 <NA> <NA>   M   0 2004-09-03 2005-01-29 0.4
#> 6 0CRGND <NA> <NA>   F   0 1982-04-01 1982-04-27 0.1

The result has standardized columns, founders (animals with no known parent) marked generation 0, and an age column. Because examplePedigree already carries an age column, qcStudbook() keeps it rather than recomputing ages against the current date, so this article renders the same numbers every time. When the input has no age column, ages are computed from birth to exit – or to today’s date for animals still in the colony.

What QC standardizes

Before testing for errors, qcStudbook() normalizes the studbook so downstream code sees consistent names and codes. Column names are lower-cased and stripped of spaces, periods, and underscores, and common aliases are renamed (egoid / ego_id -> id, sireid -> sire, damid -> dam, birthdate -> birth). Setting reportChanges = TRUE records exactly what was renamed. The shipped pedGood data set has five columns – four with deliberately messy headers (ego_id, sire.id, dam_id, birth_date) and an already-canonical sex that QC leaves untouched:

chg <- qcStudbook(pedGood, reportChanges = TRUE, reportErrors = TRUE)
Filter(length, chg$changedCols)
#> $periodRemoved
#> [1] "sire.id to sireid"
#> 
#> $underScoreRemoved
#> [1] "ego_id, dam_id, and birth_date to egoid, damid, and birthdate"
#> 
#> $egoidToId
#> [1] "egoid to id"
#> 
#> $sireIdToSire
#> [1] "sireid to sire"
#> 
#> $damIdToDam
#> [1] "damid to dam"
#> 
#> $birthdateToBirth
#> [1] "birthdate to birth"

qcStudbook() also standardizes coded values: sex to M / F / U (from MALE / FEMALE / 1 / 2 / …; by default the hermaphrodite codes H / HERMAPHRODITE / 4 are folded into U), status to ALIVE / DECEASED / SHIPPED / UNKNOWN, and ancestry to INDIAN / CHINESE / HYBRID / JAPANESE / UNKNOWN / OTHER. Character dates are parsed to Date, and UNKNOWN parents become either NA or auto-generated Unnnn placeholder IDs.

Diagnosing problems without stopping

In diagnostic mode (reportErrors = TRUE) the function returns a list whose named elements each hold one category of finding. A clean studbook returns NULL – nothing to report:

is.null(qcStudbook(pedGood, reportErrors = TRUE))
#> [1] TRUE

The package ships small, purpose-built data sets that each trigger one kind of problem; they make good worked examples.

Sex inconsistency. In pedFemaleSireMaleDam, s1 is recorded female but is the sire of two offspring, and d1 is recorded male but is used as a dam:

qcStudbook(pedFemaleSireMaleDam,
  reportErrors = TRUE)[c("femaleSires", "maleDams")]
#> $femaleSires
#> [1] "s1"
#> 
#> $maleDams
#> [1] "d1"

Invalid dates. pedInvalidDates contains "205-06-19" (a three-digit year) and "2002-16-22" (month 16); the offending row numbers are reported:

qcStudbook(pedInvalidDates, reportErrors = TRUE)$invalidDateRows
#> [1] "3" "4"

Duplicate records. pedDuplicateIds repeats animal d1 in two rows:

qcStudbook(pedDuplicateIds, reportErrors = TRUE)$duplicateIds
#> [1] "d1"

A missing required column. pedMissingBirth has no birth column at all:

qcStudbook(pedMissingBirth, reportErrors = TRUE)$missingColumns
#> [1] "birth"

An impossible parentage. In pedSameMaleIsSireAndDam, s1 is listed as both a sire and a dam:

qcStudbook(pedSameMaleIsSireAndDam, reportErrors = TRUE)$sireAndDam
#> [1] "s1"

Diagnostic mode also reports parents younger than their minimum breeding age (minSireAge for sires, minDamAge for dams) at an offspring’s birth (in suspiciousParents) and any ID containing a period (in invalidIdChars).

Production mode vs diagnostic mode

The two modes treat the same problems differently. Production mode fixes what it can prove is safe and refuses to return a pedigree it cannot trust; diagnostic mode never stops, so a single pass surfaces every issue at once.

Problem	Production mode (`reportErrors = FALSE`)	Diagnostic mode (`reportErrors = TRUE`)
Messy column names	corrected	`changedCols` (with `reportChanges = TRUE`)
Female sire / male dam	corrected (sex flipped)	`femaleSires` / `maleDams`
Exact duplicate record	removed	`duplicateIds`
Missing required column	stops	`missingColumns`
Invalid date	stops	`invalidDateRows`
Sire is also a dam	stops	`sireAndDam`
Parent below `minSireAge`/`minDamAge`	stops	`suspiciousParents`
Period in an ID	stops	`invalidIdChars`

Production mode silently corrects the safe cases. We can confirm it on pedFemaleSireMaleDam – s1 becomes male and d1 female – and on pedDuplicateIds, where the repeated row is dropped:

qcStudbook(pedFemaleSireMaleDam)[, c("id", "sex")]
#>   id sex
#> 1 d1   F
#> 2 d2   F
#> 3 s1   M
#> 4 s2   M
#> 5 o1   F
#> 6 o2   F
#> 7 o3   F
#> 8 o4   M
nrow(pedDuplicateIds)             # before
#> [1] 9
nrow(qcStudbook(pedDuplicateIds)) # the duplicate row is gone
#> [1] 8

Run on data it cannot safely repair, production mode stops with a message naming the offending rows – here, the invalid dates:

tryCatch(
  qcStudbook(pedInvalidDates),
  error = function(e) cat(conditionMessage(e))
)
#> Column 'birth' has invalid dates on row(s) 3 and 4.

The recommended workflow follows from this: run once in diagnostic mode to see everything, correct the source studbook, then run in production mode to get the clean pedigree the analyses consume.

Key arguments

Argument	Default	Meaning
`sb`	–	the studbook data frame to check
`minSireAge`	`NULL`	minimum age (years) a sire must be at an offspring’s birth; `NULL` uses the species- and sex-specific breeding age; skipped when birth dates are missing
`minDamAge`	`NULL`	minimum age (years) a dam must be at an offspring’s birth; `NULL` uses the species- and sex-specific breeding age; skipped when birth dates are missing
`reportChanges`	`FALSE`	record column-name corrections in `changedCols`
`reportErrors`	`FALSE`	diagnostic mode: scan everything and return a list of findings (or `NULL`) instead of a cleaned pedigree