Comprehensive model of human disease genetics based on the proteome

November 24, 2025

Statistics and Reproducibility

This study analyzes large-scale sequencing and variant annotation datasets, drawing from sources like UniRef58, UKBB17, gnomAD18, ClinVar59, and various diagnostic cohorts. Notably, no statistical methods were applied to pre-determine sample size; all available data from each cohort were incorporated unless stated otherwise in the methods section. Importantly, the experiments were not randomized, and investigators remained unblinded during experiments and outcome assessments. To evaluate reproducibility, the analysis included benchmarking across various independent datasets and comparing findings with previously published models. Public access to all code and trained models allows for reproduction of the analyses.

Data Acquisition

Multiple Sequence Alignments

Using established protocols, the EVCouplings pipeline was employed to create multiple sequence alignments (MSAs) from sequences in the UniRef100 database, sourced in March 2022.

Human Variation Data

Variants from the UKBB’s 500k release were annotated using VEP GRCh38 RefSeq and a custom annotation to optimize variant inclusion for existing models. Variants underwent quality filtering, and annotations were matched to ensure consistency between the reference and transcript sequences. In analyzing variants outside of training, genes with less than 95% coverage across UKBB participants were excluded.

Diagnostic Cohorts

All cohorts involved obtained informed consent, adhering to ethical guidelines relevant to their respective institutions. The SDD metacohort sourced de novo mutations from a combination of studies, totaling 31,058 subjects, and quality filtering was managed by the associated centers.

Clinical Variants

To evaluate predictive performance, we utilized clinically labeled variants from ClinVar, focusing on datasets from 2019 and 2020 for comprehensive analysis.

Model Building

The methodological aim was to rank the severity of genetic variant effects within an individual’s proteome. The constructed probabilistic model was trained on protein sequence data from both diverse species and human populations, allowing a fusion of evolutionary and population-specific insights. This model intends to ascribe accurate impacts of variants at a single-residue resolution.

Modeling Individual Proteins

Recent advancements indicate that unsupervised models trained on protein sequences from various species successfully differentiate between benign and pathogenic variants, performing comparably to functional assays. Two model types were utilized: an alignment-based model, trained on gene-specific MSAs, and an alignment-free model derived from a full protein database.

Bayesian Variational Autoencoder

Variational autoencoders (VAEs) effectively capture complex distributions. The VAE we employed captures the conditional probability of amino acid presence at specified positions based on hidden variables. The scoring mechanism utilized the evidence lower bound (ELBO) to derive the fitness score.

Sequence Reweighting

In light of the limitations posed by phylogenetic and ascertainment biases, sequence reweighting was applied to correct for distribution irregularities. A reweighting formula adjusted sequence importance based on their alignment distance in corresponding MSAs.

Masked Language Model

The model training also incorporated ESM-1v, a language model leveraging a self-supervised masking process, enhancing its predictive capabilities for amino acid substitutions.

Estimating Variant-Level Constraint

While the individual gene models rank variants effectively, they struggle to compare variants across different genes. To tackle this, popEVE was introduced as the pioneering model for proteome-wide assessment of missense variant impacts.

Performance Assessments

Within-Protein Comparison

We assessed our model’s proficiency in ranking pathogenic variants by correlating model predictions with deep mutational scans, as well as evaluating its accuracy in predicting benign and pathogenic labels from ClinVar.

Cross-Protein Comparison

To ascertain the model’s ability to distinguish between various severity levels in clinical variants, a dataset of high-impact variants was created. This involved setting thresholds to differentiate benign from pathogenic classifications effectively.

Analysis of Patient Data

Two strategic approaches were taken to analyze de novo mutations from the metacohort, aiming to provide insights for unresolved genetic cases. Each approach, including burden testing and direct case-variant associations, was formulated to evaluate the significance of specific variants.

CHOOSE LANGUAGE

SELECT LANGUAGE BELOW

Comprehensive model of human disease genetics based on the proteome

Statistics and Reproducibility

Data Acquisition

Multiple Sequence Alignments

Human Variation Data

Diagnostic Cohorts

Clinical Variants

Model Building

Modeling Individual Proteins

Bayesian Variational Autoencoder

Sequence Reweighting

Masked Language Model

Estimating Variant-Level Constraint

Performance Assessments

Within-Protein Comparison

Cross-Protein Comparison

Analysis of Patient Data

Related News

Andy Ogles aims to address ‘chain migration’ in comprehensive legal immigration reform

Murder suspect in Charlie Kirk case aims to limit media access at court hearing

A new tax credit won’t solve what Sunday schools once taught.

Scott Bessent brought into Situation Room during interview — indicates US Navy may escort oil tankers through Strait of Hormuz

SELECT LANGUAGE BELOW

Comprehensive model of human disease genetics based on the proteome

Statistics and Reproducibility

Data Acquisition

Multiple Sequence Alignments

Human Variation Data

Diagnostic Cohorts

Clinical Variants

Model Building

Modeling Individual Proteins

Bayesian Variational Autoencoder

Sequence Reweighting

Masked Language Model

Estimating Variant-Level Constraint

Performance Assessments

Within-Protein Comparison

Cross-Protein Comparison

Analysis of Patient Data

Related News

Expert bartenders share their tips

Andy Ogles aims to address ‘chain migration’ in comprehensive legal immigration reform

Murder suspect in Charlie Kirk case aims to limit media access at court hearing

A new tax credit won’t solve what Sunday schools once taught.

Scott Bessent brought into Situation Room during interview — indicates US Navy may escort oil tankers through Strait of Hormuz

AI leader Emil Michael describes Anthropic as ‘wild’