Clinical overview
Every clinical decision a registrar makes — to give magnesium sulphate, to offer caesarean after a previous caesarean, to start anti-D — rests on a chain of evidence that someone analysed statistically. You cannot read a guideline, critique a journal club paper, defend a management plan in the FCOG(SA) oral, or design the research dissertation required for MMed without a working command of biomedical statistics. This is not a niche academic skill: it is the literacy that lets you tell a real treatment effect from noise, a useful test from a misleading one, and a trustworthy trial from a fragile one.
The FCOG(SA) examiners do not expect you to derive formulae. They expect interpretation under uncertainty: given a 2×2 table, a confidence interval, a forest plot, or a Kaplan–Meier curve, can you say what it means for the patient in front of you and the population behind her? This objective is weighted to higher-order thinking — you will be asked to apply, analyse and judge, not merely define. The discipline runs through the whole curriculum: the sensitivity of the cervical screening HPV test, the number-needed-to-treat for aspirin in pre-eclampsia prophylaxis, the relative risk reduction in the WOMAN trial of tranexamic acid for postpartum-haemorrhage. Statistics is the connective tissue of evidence-based obstetrics and gynaecology.
Core knowledge
Types of data and descriptive statistics
Know your variable types, because they dictate the test. Categorical data are nominal (blood group, mode of delivery) or ordinal (Apgar score, cancer stage). Numerical data are discrete (parity, number of antenatal visits) or continuous (birthweight, blood pressure). Continuous data are summarised by a measure of central tendency and one of spread: use mean and standard deviation (SD) for symmetrical (approximately normal) distributions, and median and interquartile range (IQR) for skewed data — birthweight is roughly normal, but length of hospital stay and serum βhCG are right-skewed and demand the median. The standard error of the mean (SEM = SD/√n) is not a measure of spread of the data; it measures the precision of the estimate and shrinks as the sample grows. Confusing SD with SEM is a classic error that makes a result look more precise than it is.
Distributions, the null hypothesis and p-values
Much of frequentist inference assumes an underlying normal (Gaussian) distribution, where ~68% of observations lie within 1 SD of the mean and ~95% within ~1.96 SD (standard teaching). We test a null hypothesis (H₀) — typically "no difference between groups" — against an alternative. The p-value is the probability of observing data as extreme as, or more extreme than, those obtained if the null hypothesis were true. By convention p < 0.05 is called "statistically significant," but this threshold is arbitrary and much abused. A p-value is not the probability that the null hypothesis is true, not the probability the result arose by chance, and tells you nothing about the size or clinical importance of an effect. A trivial difference can be highly significant in a huge sample; a clinically important difference can be non-significant in a small one.
Confidence intervals
Prefer the confidence interval (CI) to the bare p-value. A 95% CI is the range that, on repeated sampling, would contain the true population value 95% of the time — pragmatically, the plausible range for the true effect. Its width reflects precision (narrow = precise, usually from a large sample). The CI carries the significance test within it:
- For a difference (mean difference, risk difference), a 95% CI that crosses 0 is non-significant at p < 0.05.
- For a ratio (relative risk, odds ratio, hazard ratio), a 95% CI that crosses 1 is non-significant.
This single rule lets you read most results in a paper without the p-value at all.
Type I and Type II error, power
Two ways to be wrong. A Type I (α) error is a false positive — rejecting a true null, "finding" an effect that is not real; α is conventionally set at 0.05. A Type II (β) error is a false negative — failing to detect a real effect, usually because the study is too small. Power = 1 − β is the probability of detecting an effect that genuinely exists; trials are typically powered to 80–90%. Underpowered studies are the commonest reason a real benefit is "not significant," which is why "no significant difference" must never be read as "no difference" — absence of evidence is not evidence of absence. Multiple testing inflates the Type I error: test twenty independent hypotheses at α = 0.05 and you expect one false positive by chance, the rationale for caution about unplanned subgroup analyses and for corrections such as Bonferroni.
Measures of effect
| Measure | Definition | Used in |
|---|---|---|
| Relative risk (RR) | Risk in exposed ÷ risk in unexposed | Cohort studies, RCTs |
| Odds ratio (OR) | Odds in exposed ÷ odds in unexposed | Case-control studies, logistic regression |
| Absolute risk reduction (ARR) | Control event rate − treated event rate | RCTs |
| Number needed to treat (NNT) | 1 ÷ ARR | Translating trials to practice |
| Hazard ratio (HR) | Ratio of event rates over time | Survival / time-to-event |
Relative measures (RR, OR, HR) can exaggerate clinical importance. A "50% relative risk reduction" is dramatic if the baseline risk is 40% (ARR 20%, NNT 5) and almost meaningless if the baseline is 0.2% (ARR 0.1%, NNT 1000). Always look for the absolute numbers and the NNT. The OR approximates the RR only when the outcome is rare; for common outcomes the OR overstates the RR away from 1. The NNT's mirror image is the number needed to harm (NNH).
Survival analysis
Time-to-event outcomes (recurrence, death) that include patients with incomplete follow-up are analysed with Kaplan–Meier curves, where steps mark events and tick marks denote censored patients (lost to follow-up or event-free at study end). Curves are compared with the log-rank test, and the Cox proportional-hazards model gives an adjusted hazard ratio. This is the language of gynaecological oncology outcome data.
Figure G1.1 — Core inference toolkit for reading study results: data type, p-values, 95% confidence intervals, error, power and effect-size translation.
Assessment
Critically appraising a study — the diagnostic test 2×2
For a diagnostic or screening test, build the 2×2 against the reference standard:
| Disease + | Disease − | |
|---|---|---|
| Test + | True positive (TP) | False positive (FP) |
| Test − | False negative (FN) | True negative (TN) |
- Sensitivity = TP/(TP+FN) — proportion of diseased correctly identified. A highly sensitive test, when negative, rules out disease (SnNOUT).
- Specificity = TN/(TN+FP) — proportion of healthy correctly cleared. A highly specific test, when positive, rules in disease (SpPIN).
- Positive predictive value (PPV) = TP/(TP+FP) — of those who test positive, the proportion truly diseased.
- Negative predictive value (NPV) = TN/(TN+FN) — of those who test negative, the proportion truly healthy.
The crucial teaching point for the exam: sensitivity and specificity are intrinsic to the test and stable across populations, but PPV and NPV depend on prevalence (pre-test probability). The same test deployed in a low-prevalence screening population (e.g. asymptomatic HPV screening) yields many false positives and a low PPV; in a high-prevalence referral clinic the PPV rises. This is why a positive screening result is not a diagnosis and why confirmatory testing matters.
Likelihood ratios (LRs) combine these and, unlike predictive values, are prevalence-independent. LR+ = sensitivity/(1−specificity); LR− = (1−sensitivity)/specificity. An LR+ > 10 or LR− < 0.1 substantially shifts probability; LRs feed Bayesian post-test probability and are the most portable summary of a test's performance. The ROC curve plots sensitivity against (1−specificity) across thresholds; the area under the curve (AUC) summarises discrimination (0.5 = useless, 1.0 = perfect).

Figure G1.2 — Diagnostic-test map linking the 2×2 table to sensitivity, specificity, predictive values, likelihood ratios and ROC curves.
Hierarchy of evidence and study designs
Rank designs by their resistance to bias:
- Systematic review / meta-analysis of RCTs (highest).
- Randomised controlled trial (RCT) — randomisation balances known and unknown confounders; blinding and allocation concealment guard against bias.
- Cohort study — prospective, follows exposed vs unexposed forward; gives incidence and RR.
- Case-control study — retrospective, compares diseased (cases) with non-diseased (controls) for prior exposure; efficient for rare diseases; gives the OR.
- Cross-sectional study — snapshot, gives prevalence.
- Case series / case report / expert opinion (lowest).
This maps onto formal levels of evidence and grades of recommendation (e.g. GRADE), which underpin how the guidelines you cite were built (standard teaching).
Bias, confounding and validity
Bias is systematic error and cannot be fixed by enlarging the sample. Know selection bias (non-representative sampling; e.g. recruiting only tertiary-hospital patients), information/measurement bias (including recall bias, endemic to case-control studies, and observer bias, countered by blinding), and lead-time / length-time bias (which flatter screening programmes). Confounding is a distortion by a third variable associated with both exposure and outcome (the classic: coffee "causes" cancer when smoking confounds). Confounding is addressed by randomisation (the only method that also balances unknown confounders), restriction, matching, stratification, and multivariable regression (linear for continuous outcomes, logistic for binary, Cox for time-to-event). Internal validity is freedom from bias within the study; external validity (generalisability) asks whether the result applies to your patients — a pre-eclampsia trial in well-resourced Europe may not transfer cleanly to a South African district hospital.
Meta-analysis and the forest plot
A meta-analysis pools comparable studies for a single weighted estimate (larger, lower-variance studies weighted more). On the forest plot, each study is a box (size ∝ weight) with its CI whiskers; the pooled estimate is the diamond at the bottom, whose width is its CI and whose position relative to the line of no effect (1 for ratios) gives significance. Heterogeneity — genuine variation in effect between studies — is quantified by the I² statistic (roughly: >50% is substantial; standard teaching); high heterogeneity argues for a random-effects model and caution in pooling. Publication bias (positive trials more likely published) is screened with a funnel plot.
Management
"Management" here is the disciplined application of statistics to clinical decisions and to your own MMed research.
Applying numbers at the bedside
- Translate relative to absolute before counselling. Express benefit as ARR and NNT, and harm as NNH, in the patient's own baseline-risk terms. Telling a woman a drug "halves her risk" is meaningless without her starting risk.
- Read the CI, not just the p-value. Ask whether the whole plausible range (the CI) is clinically worthwhile, and whether it crosses the threshold of no effect.
- Check external validity before transplanting a result into South African practice — population, level of care, comorbidity (notably HIV prevalence), and resource setting. A diagnostic test's PPV recomputed for local prevalence may change the management algorithm entirely.
- Distinguish statistical from clinical significance in both directions: a significant-but-tiny effect may not warrant changing practice; a non-significant trend in an underpowered study may still merit a larger trial.
Designing your MMed research
The FCOG(SA)/MMed pathway requires a research component, so the registrar must be able to build a sound study, not merely read one.
- Frame an answerable question (PICO: Population, Intervention/exposure, Comparator, Outcome) and a clear primary outcome — pre-specified, to avoid outcome-switching.
- Choose the design that fits the question and is feasible — you rarely randomise harms or rare diseases.
- Calculate sample size a priori from the expected effect size, chosen α (0.05), desired power (80–90%), and outcome variance. An underpowered study is unethical: it exposes participants to risk without the ability to answer the question.
- Plan the analysis before collecting data — pick the test by data type (t-test or its non-parametric Mann–Whitney for two groups, ANOVA/Kruskal–Wallis for more, χ² or Fisher's exact for categorical, paired tests for paired data, correlation/regression for relationships). Pre-specify subgroups.
- Minimise bias by design — randomise, conceal allocation, blind where possible, define variables objectively, and plan for missing data and loss to follow-up.
- Obtain ethics approval — every study on human participants needs Health Research Ethics Committee clearance, aligned with the National Health Act 61 of 2003 and the POPIA framework governing patient data (see informed-consent and sa-og-law). This is non-negotiable in South Africa.

Figure G1.3 — Evidence-to-action ladder for appraising study design, bias, forest plots and the MMed research workflow.
Statistics in South African obstetric governance
Statistics is not abstract in South African practice — it is the method of our national audit. The Saving Mothers report (NCCEMD) is a triennial epidemiological analysis of maternal deaths that computes the institutional Maternal Mortality Ratio (iMMR) — deaths per 100,000 live births — and ranks causes (obstetric haemorrhage, hypertension, and non-pregnancy-related infection including HIV are leading contributors). These rates, with their numerators (deaths) and denominators (live births), are descriptive epidemiology driving the priorities in the NDoH National Integrated Maternal and Perinatal Care Guideline (NDoH, 2024). Perinatal audit (the PPIP system) computes stillbirth, early-neonatal-death and perinatal-mortality rates. To engage with morbidity and mortality meetings, to interpret your own facility's iMMR against the national figure, and to know whether a change after an intervention is real or random variation, you need exactly the literacy in this chapter.
Red flags / pitfalls
- Reading a p-value as effect size. "p < 0.001" says the result is unlikely under the null, not that the effect is large or important. Always demand the effect size and its CI.
- "No significant difference" = "no difference." Almost always a power problem. Check the sample size and the CI width before believing a negative trial.
- Quoting relative risk reduction without the absolute risk. The single most common way trial results (and pharmaceutical marketing) mislead. Convert to ARR and NNT.
- Confusing correlation with causation. Observational association is hypothesis-generating; only design (ideally randomisation) and Bradford Hill–type reasoning support causal claims.
- Confusing SD with SEM, or OR with RR when the outcome is common (the OR then overstates the effect).
- Ignoring prevalence when interpreting a positive test. A positive screening result in a low-prevalence population is more often a false positive than true disease — the failure behind unnecessary intervention and patient anxiety.
- Trusting unplanned subgroup analyses. Test enough subgroups and one will be "significant" by chance (multiplicity). Pre-specification and interaction testing are the safeguards.
- Per-protocol instead of intention-to-treat (ITT) in an RCT. ITT analyses participants in the group they were randomised to, preserving the benefit of randomisation; per-protocol analysis (only the compliant) reintroduces selection bias and usually inflates the apparent effect.
- Assuming external validity. A trial result from a high-resource setting may not hold in a South African district hospital with different prevalence, comorbidity and resources — check before you transplant the guideline.
- Misreading a forest plot. The diamond is the pooled estimate; if it touches the line of no effect (1 for ratios) the pooled result is non-significant. High I² means the studies disagree — pool with caution.
Evidence anchors
Biomedical statistics is textbook canon rather than guideline-bound — the definitions of sensitivity, specificity, PPV/NPV, likelihood ratios, RR/OR/NNT, p-values and confidence intervals, types of error, study designs, and meta-analysis are standard teaching and are listed as such for Domain G in docs/VERIFIED-SOURCES.md. The chapter therefore states these as standard teaching rather than attaching guideline citations to arithmetic.
Where statistics meets South African obstetric practice, the verified anchors are:
- National Integrated Maternal and Perinatal Care Guideline (NDoH, 2024) — the SA obstetric source of truth, itself built on national audit data.
- Saving Mothers report (NCCEMD) — triennial maternal-death audit; the institutional Maternal Mortality Ratio and cause-ranking that exemplify applied descriptive epidemiology (obstetric haemorrhage, hypertension and non-pregnancy-related infection/HIV as leading causes).
- National Health Act 61 of 2003 — the statutory basis for research ethics approval and informed consent in any South African study (Domain H reference).
International evidence cited elsewhere in this textbook demonstrates these statistical concepts in action — for example the relative-versus-absolute effect sizes and confidence intervals reported in the WOMAN trial of tranexamic acid for postpartum haemorrhage and the prophylactic-aspirin evidence summarised in NICE NG133 (Hypertension in pregnancy, 2019) — and are anchored fully in their own chapters.
