Identification of 5 HF subtypes using a machine learning approach
Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study
Introduction and methods
Current HF subtype classifications have not resulted in precision medicine, personalized care, or targeted therapies [1-6]. Moreover, incomplete knowledge of HF subtypes across the wide spectrum of causal factors and populations has limited primary prevention and screening guidelines for this disease [7,8].
Aim of the study
In a large population of patients with incident HF, the authors used machine learning to (1) identify subtypes with clinical relevance throughout the HF disease course, and low risk of bias for patient selection and algorithms; (2) demonstrate internal, external, prognostic, and genetic validity; and (3) develop potential clinical pathways to improve impact.
In this external, prognostic, and genetic validation study, the authors used their 2021 framework for practical machine learning implementation consisting of 6 stages: clinical relevance, patients, algorithm, internal validation (within dataset and across methods), external validation (across methods), clinical utility, and effectiveness) . Data of patients with incident HF aged ≥30 years were extracted from 2 population-based electronic health record databases in the UK, Clinical Practice Research Datalink (CPRD; n=188,800) and The Health Improvement Network (THIN; n=124,262), from 1998 to 2018.
The CPRD and THIN datasets yielded 645 factors before and after HF diagnosis, including demographic information, comorbidities, and medication use and persistence. For the algorithm, 87 of these 645 factors were selected. To reduce the risk of algorithmic bias, the following 4 unsupervised machine learning methods were compared: K-means, hierarchical, K-medoids, and mixture modeling.
Subtypes were identified and evaluated for: (1) external validity; (2) prognostic validity (predictive accuracy for 1-year all-cause mortality); and (3) genetic validity (associations with single nucleotide polymorphisms (SNPs) and polygenic risk scores (PRSs) for HF-related traits, using UK Biobank data (n=9573)).
To assess clinical utility, 5 HF clinicians were asked about clinical relevance, justification, and interpretability of the results. Based on their input, a model predicting cluster and survival was developed, as well as an HF cluster app for routine clinical use.
Internal and external validations and subtype identification
- In the internal validation, the optimal number of clusters was 5. Based on demography, CVD risk factor burden, AF, CVD, medications, and laboratory factors, 5 clusters were identified, which were labelled as the following 5 HF subtypes: (1) early onset, (2) late onset, (3) AF-related, (4) metabolic, and (5) cardiometabolic.
- In the external validation, subtypes were similar across datasets (for THIN model in CPRD, c-statistic ranged from 0.79 (subtype 3) to 0.94 (subtype 1) and for CPRD model in THIN ranged from 0.79 (subtype 1) to 0.92 (subtypes 2 and 5)).
- Distribution of the 5 subtypes was similar across the CPRD and THIN datasets, with late onset (~33%) and cardiometabolic (~29%) being the most common subtypes and AF-related (~9%) being the least common subtype.
- In the prognostic validation in CPRD using the THIN model, 1-year all-cause mortality after HF diagnosis was 0.20 (95%CI: 0.14–0.25) for subtype 1, 0.46 (95%CI: 0.43–0.49) for subtype 2, 0.61 (95%CI: 0.57–0.64) for subtype 3, 0.11 (95%CI: 0.07–0.16) for subtype 4, and 0.37 (95%CI: 0.32–0.41) for subtype 5. Between THIN and CPRD, differences in mortality were seen for clusters 1 and 5 but not for other clusters.
- The risks of nonfatal CVD and all-cause hospitalization also varied by HF subtype.
- In the genetic validation, PRSs for atrial arrhythmias, DM, hypertension, MI, obesity, stable angina, and unstable angina were all associated with ≥1 HF subtype after correction for multiple testing (P<0.0009). The late onset and cardiometabolic subtypes broadly associated with similar PRSs.
- Eight SNPs were nominally associated with predicted HF subtypes (P=0.049), of which 4 SNPs were limited to the AF-related subtype.
Clinical utility and effectiveness
- The 5 HF clinicians reported the included factors and identified clusters had clinical relevance as per the authors’ 2021 framework.
- The clinicians also felt the developed app reflected the identified HF subtypes and could enable testing of effectiveness and cost-effectiveness in appropriately designed, prospective studies.
Using their 6-stage framework for machine learning implementation, the authors identified 5 HF subtypes (early onset, late onset, AF-related, metabolic, and cardiometabolic) and validated these subtypes based on population-representative data. The 5 subtypes showed good predictive accuracy for 1-year all-cause mortality. To assess effectiveness of their approach, the authors also developed an open-access HF cluster app that clinicians can use to identify the cluster that fits a particular patient and their predicted survival.