New screening tool uses machine learning to identify individuals with probable FH in large datasets
Precision screening for familial hypercholesterolaemia: a machine learning study applied to electronic health encounter data
Introduction and methods
In the United States, fewer than 10% of individuals with familial hypercholesterolemia (FH) are identified, which leaves them untreated, despite their likely elevated LDL-c levels and risk of premature coronary artery disease [1-3]. US Guidelines recommend screening to identify families with FH, but the best method for large-scale screening has yet to be established. A successful method probably includes both efficient cascade screening and effective index identification [4-8].
The authors have previously reported on successful application of a machine learning model to identify undiagnosed individuals with FH, built from and applied to single healthcare institutions . This study aims to build a model that can be applied at both the institutional and national healthcare database scales to identify new index cases. Therefore, the FIND FH machine learning model was constructed. Model characteristics were defined using individuals with FH and individuals presumed not to have FH. Consequently, it was tested whether the model can identify individuals with a medical profile consistent with FH in independent clinical settings.
Electronic health records (EHR) structured data were used from four large academic health systems were used to build and train the FIND FH machine learning model. For training of the model, a case was defined as an individual with a clinical diagnosis of FH by a lipid expert (939 individuals, 42% of whom were genetically confirmed) and a presumed control without FH as an individual with no previous diagnosis of FH by a lipid expert in their medical record (83136 individuals).
For more details regarding development of the algorithm, we refer to the original article.
- The final FIND FH model includes several features: demographic data, conditional data that captures patient health response during therapy, prescription-based, diagnosis-based, procedure-based and laboratory results-based data.
- When looking at which feature most effectively contributed to distinguishing between individuals with or without FH, thus which features had the greatest effect on model performance, laboratory-based features were the most frequent contributor, followed by healthcare encounter-based features (i.e. prescription, diagnosis, and procedure).
- The FIND FH model had a precision (positive predictive value) of 0.85 and a recall (sensitivity) of 0.45, when tested on the holdout tuning dataset with prevalence 1:71.
- In the first external validation dataset, namely the national dataset, 1,331,759 out of 170,416,201 individuals were flagged as likely having FH. 45 Of those were reviewed and in line with the positive predictive value, 87% (95%CI: 73-100) of them were identified as having possible, probable or definite FH by at least one of the diagnostic criteria or by a physician.
- If a selection was made based on LDL-c >190 mg/dL, only 46% of the 39 new cases of FH would have been identified.
- In a second external validation set, 866 of 173,733 individuals were flagged, of whom 103 were reviewed. 77% (95%CI: 68-86) of those were identified as having possible, probable or definite FH. Applying an LDL-c threshold would have only identified 47% of the 79 new cases.
The FIND FH is a machine learning model that could identify phenotypic FH when applied to large medical datasets. In two distinct types of large medical datasets, it identified a large number of individuals with probable FH who had not previously been diagnosed. The model was built on longitudinal medical data from individuals with at least one documented CV disease risk factor in their history. It does not rely on specific information such as tendon xanthomas or family history. Importantly, FIND FH does not only rely on lipid concentrations, which is an advantage as many patients identified did not have lipid levels in their EHR.
After having repeated the risks associated with having FH and the potential of treatment when the disease is recognized early, Pereira  notes how the fact that it is dominantly inherited facilitates cascade screening. Nonetheless, most societies fall short in identifying new FH cases, and thus in targeting them with preventive strategies.
Pereira lists many reasons used to explain the low numbers of identified subjects with FH, but he states that none of these justify the inaction. ‘In an era of personalized medicine, FH is potentially one of the most tractable conditions to deliver the promised society-wide benefits of the implementation of this paradigm,’ he says. Controlling the disease requires a multipronged approach and the orchestrated participation of several parties. The article of Myers and colleagues forms ‘a contribution to the toolbox of those fighting under-diagnosis of the disease’.
Pereira does question whether flagging algorithms that are native to electronic health systems will induce real advances in combating the disease. FIND FH, in its current state, will only flag potential patients to physicians who choose to receive the information. For the system to translate into better control of diagnosed patients, further steps are also required. Additional tools are needed, which help administrators to monitor the deliverables of this type of system. Data on what is done with the information on identified cases, with regard to their lipid levels, contacted family members, and treatment (response), will yield information on new bottlenecks, which should be used to create a roadmap for implementation of the new system in such a way that it can curb the disease.
Pereira expresses the hope that the already available technology will already be benefited from, to pave the way to truly transformative care of individuals with FH.