Unique genetic structure in the tails of intricate traits

May 28, 2026

Overview of UK Biobank Genotype Data

The UK Biobank (UKB) is a prospective cohort study that enrolled around 500,000 participants across the UK between 2006 and 2010. Various phenotype data on anthropometric, biological, and lifestyle metrics were gathered at the beginning and through follow-ups. This is coupled with health and disease record data. The genetic component includes 488,377 samples that were genotyped at 805,426 SNPs. To categorize population ancestries, a 4-means clustering analysis was executed based on the first two principal components of the genotype dataset. Ancestries were identified mainly by participants’ countries of birth, resulting in 461,931 individuals of European descent, 11,074 South Asians, 7,935 Africans, 2,585 West Asians, and 2,550 East Asians, along with 1,619 individuals in clusters without a majority country of birth.

After clustering, standard quality control (QC) steps were conducted for each ancestry cluster. SNPs with a minor allele frequency (MAF) under 0.02 or those failing the Hardy–Weinberg equilibrium test (P < 10⁻⁸) were excluded. Samples with significant missing data or inconsistencies in reported versus genetic sex were also removed. To maximize sample size for general population analyses, a greedy algorithm eliminated related individuals while retaining unrelated ones. After QC, 411,948 unrelated individuals remained, 387,472 of whom were European. Within the European ancestry cluster, there were 18,340 individuals with repeated measures set aside and 24,476 individuals with diverse non-European ancestries, leading to a primary analysis of up to 369,132 unrelated European ancestry individuals.

When examining siblings, pairs were initially found based on kinship coefficients indicative of first-degree relatives. Only European ancestry individuals were included here, mainly due to insufficient statistical power in the multi-ancestry sample. A specific threshold was used to differentiate sibling pairs from parent–offspring pairs. Thus, 17,289 sibling pairs were ultimately selected for analysis.

Initial Trait Selection

We began with a list of 408 non-procedural continuous traits gathered from UKB’s categories, such as biomarkers, physical measures, and cognitive function. Several traits related to reproductive success were discarded to prevent circular reasoning, along with predicted traits that had measured alternatives. Additionally, individuals with cancer, pregnant individuals, and those on specific medications were removed to reduce non-genetic variance influence. Extreme outlier samples were also excluded. Ultimately, this left 141 quantitative traits for further QC.

Trait Residualization

To lessen the impact of environmental risk factors, trait values were adjusted within each ancestry group through linear regression incorporating covariates like age, sex, and lifestyle factors. Outliers from the larger European ancestry group were eliminated during this process. The remaining traits were standardized for subsequent analyses using a rank inverse normal transformation. This adjustment, while conservatively biased, was necessary to stabilize the statistical metrics within linear regression models.

Genome-Wide Association Studies (GWAS) and Further QC

For the initial analyses, European ancestry samples were divided into base and target datasets to compute GWAS results. Each trait was analyzed in the base dataset using PLINK-1.9. Heritability and genetic correlations were assessed using LD score regression. Traits that exhibited low heritability or showed major SNPs affecting variance were excluded, ensuring only polygenic traits remained. Post-QC, 114 quantitative traits were approved across various categories, further refined into a set of 74 traits for primary analyses, ensuring a well-rounded representation across phenotypes.

PRS Calculation

The final 74 traits included a range from biomarkers to cognitive functions, and calculations were performed using PRSice-2 to determine polygenic risk scores for individuals in the target dataset. This was done to optimize the sample for testing in extreme trait distributions. The results from the PRS calculations were used in subsequent POPout analyses, with a keen eye on minimizing overfitting through careful statistical practices.

Replication Datasets

European Ancestry Repeated Measures

Following the initial QC, up to 18,340 individuals of European ancestry had repeated measures for 63 of the traits. These measurements provided a more precise target subset by averaging values in order to counter transient effects that could skew results.

Multi-ancestry Sample

Post-QC, there were 24,476 individuals of varied ancestries available for replication, normalized alongside the European samples for consistency in PRS calculations.

All of Us Cohort

The All of Us program is a diverse biobank in the U.S. with a broader participant age range. We identified 31 overlapping traits for replication, ensuring trait descriptions were similar to those in our UKB analysis. Trait QC for All of Us mirrored that of UKB as much as possible, with adjustments made for medications and outliers.

POPout Analyses

The POPout test was structured to highlight deviations in common-variant architecture related to genetic factors. Regression of environmental influences was applied to yield cleaner data for testing. The test assessed PRS against trait values to check for systematic overestimations at the extreme ends of trait distributions. The procedure involved applying two-sided tests to examine deviations from expected distributions, particularly focusing on the tails.

Sibling-Based Analyses

We utilized a sibling-based approach to detect deviations due to significant alleles, examining trait value distributions among siblings. This method was aligned with established theoretical frameworks to identify common-variant architecture disruptions, considering both Mendelian and de novo influences in our analyses.

Conclusion

This comprehensive study of UK Biobank genotype data highlights significant advancements in understanding genetic influences on various traits. By bringing together large datasets and applying rigorous QC and analytical frameworks, the findings could inform future genetic research and applications.