If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, CanadaHarvey E. Beardmore Division of Pediatric Surgery, The Montreal Children's Hospital, McGill University Health Centre, Montreal, Quebec, Canada
Clinical prediction tools (CPTs) are decision-making instruments utilizing patient data to predict specific outcomes, and were first developed using statistical models, but are now increasingly being supplemented by machine learning (ML).
•
This systematic review investigated the clinical validity and applicability of ML-based CPTs compared to statistical CPTs in pediatric surgery.
Summary
Purpose
Clinical prediction tools (CPTs) are decision-making instruments utilizing patient data to predict specific clinical outcomes, risk-stratify patients, or suggest personalized diagnostic or therapeutic options. Recent advancements in artificial intelligence have resulted in a proliferation of CPTs created using machine learning (ML) - yet the clinical applicability of ML-based CPTs and their validation in clinical settings remain unclear. This systematic review aims to compare the validity and clinical efficacy of ML-based to traditional CPTs in pediatric surgery.
Methods
Nine databases were searched from 2000 until July 9, 2021 to retrieve articles reporting on CPTs and ML for pediatric surgical conditions. PRISMA standards were followed, and screening was performed by two independent reviewers in Rayyan, with a third reviewer resolving conflicts. Risk of bias was assessed using the PROBAST.
Results
Out of 8,300 studies, 48 met the inclusion criteria. The most represented surgical specialties were pediatric general (14), neurosurgery (13) and cardiac surgery (12). Prognostic (26) CPTs were the most represented type of surgical pediatric CPTs followed by diagnostic (10), interventional (9), and risk stratifying (2). One study included a CPT for diagnostic, interventional and prognostic purposes. 81% of studies compared their CPT to ML-based CPTs, statistical CPTs, or the unaided clinician, but lacked external validation and/or evidence of clinical implementation.
Conclusions
While most studies claim significant potential improvements by incorporating ML-based CPTs in pediatric surgical decision-making, both external validation and clinical application remains limited. Further studies must focus on validating existing instruments or developing validated tools, and incorporating them in the clinical workflow.
The authors of this manuscript have no conflicts of interest to disclose.
Previous communication
This manuscript is based on the abstract that was accepted for the 2022 Canadian Association of Pediatric Surgeons (CAPS) Annual Meeting.
Financial support statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
1. Introduction
Clinical prediction tools (CPTs) have become ubiquitous in the diagnosis and treatment of many conditions. CPTs are defined as clinical decision-making instruments which utilize different aspects of the patient's clinical history, physical examination and various biologic and imaging test results to predict a specific clinical outcome, risk-stratify patients, or suggest a personalized diagnostic or therapeutic course of action [1]. Historically, CPTs were first developed using statistical models, often regression models (e.g. logistic regression or recursive partitioning), in order to create a clinical decision-making framework [2]. Their overall aim was to standardize patient treatments while simultaneously improving outcomes.
Recently, multiple CPTs have been created using machine learning (ML, a subset of artificial intelligence), rather than traditional statistical methods, due to advancements in electronic information storage and sharing [3]. Traditional statistical CPTs often rely on linear regression models and subjective human input to identify variables of interest correlated with a specific outcome. On the other hand, ML focuses on computer algorithms that “learn” from input data to predict a specific outcome [4]. ML-based CPTs will extrapolate, from a limited set of inputs regarding a patient encounter, a possible diagnosis or risk assessment to guide future clinical decision-making. Therefore, they leverage the ability of ML to deal with large and complex datasets such as the increasingly available electronic health records data to extrapolate simple decision-making frameworks.
While most ML-based CPTs have compared favourably to traditional tools in silico (i.e. outside clinical settings), there is limited evidence of superiority of such tools in actual clinical workflows [5]. For example, Marcinkevics et al. found that ML-based CPT algorithms achieved better diagnostic performance than either the Alvarado or the Pediatric Appendicitis Scores in a pediatric population with suspected appendicitis. However, the clinical applicability of these ML-based CPTs and their validation in various clinical settings remain to be determined [6].
Whilst many ML-based CPTs have recently been devised for a variety of different surgical pediatric conditions, it remains unclear if they were externally validated or implemented in the clinical setting. Therefore, this systematic review aims to examine the clinical efficacy and validity of ML-based CPTs as compared to statistical CPTs currently in use in the pediatric population. Our results are expected to enhance the development, validation and integration of ML-based CPTs in pediatric surgery, and shed light on the current gaps that exist in their clinical implementation and validation.
2. Methods
We conducted a systematic review of the published literature interrogating the use of CPTs and ML in pediatric surgery. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and checklist for conducting systematic reviews were used [
The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration.
] (see Supplementary Material). The review was registered with the National Institute for Health Research’s PROSPERO website (CRD42021268036) and Open Science Framework (https://doi.org/10.17605/OSF.IO/J8M9D). A senior medical librarian (EG) searched the following databases from 2000 until July 9, 2021: Medline (Ovid), Embase (Ovid), Cochrane (Wiley), Global Health (Ovid), Web of Science (Clarivate Analytics), ProQuest Central, Inspec – Engineering Village (Elsevier), Africa Wide Information (Ebsco) and Global Index Medicus (WHO) with no language restrictions. The search strategy used variations in text words found in the title, abstract or keyword fields, and relevant subject headings to retrieve articles looking at CPTs and ML in the pediatric setting. Animal studies were excluded (see Supplementary Material for the full search strategy and PRISMA-S extension for searching).
References found were imported into EndNote X9, where duplicates were removed. Records were then imported into the online platform Rayyan [
] and screened by two independent reviewers (AB & ZA), with a third reviewer (DP/EG) resolving conflicts. Inter-rater reliability was measured using the first 50 articles, aiming for a kappa score above 80% prior to commencing the two-step screening process. Articles were first screened by title and abstract, followed by full-text reviews of included articles. The primary reason for exclusion was documented in a Google Sheet.
2.1 Inclusion and Exclusion Criteria
Studies were included if (1) the participant group was composed of children (birth to 18 years of age), (2) data from pediatric population was analyzed separately from those of the adult population if both samples were assessed, (3) data from surgical population was analyzed separately from those of the non-surgical population if both samples were assessed, and (4) the study compared the validity and clinical efficacy of CPTs in pediatric surgery.
Studies were excluded if (1) participant group was solely composed of preterm neonates or adults (over 18 years of age), (2) study involved animals, (3) the condition studied was not surgical, (4) article was a conference abstract or conference paper, clinical trial or a methodological study based on an ML-based CPT, (5) studies for which it was not possible to obtain the full text, (6) studies for which patient demographics were absent and/or the CPT-related methodology was incomplete, and (7) studies using only images, signals, gene expression profiles or other genetic data. Studies were also excluded if they were literature - narrative, scoping or systematic - reviews, however, individual studies included in literature reviews were kept if they met the inclusion criteria.
2.2 Data Extraction and Analysis
The following data were extracted from all studies: country of origin, study type, surgical specialty, condition studied, demographics of study population (e.g. sample size, average age and female-to-male ratio), CPT type (e.g. diagnostic, prognostic, interventional, or risk stratifying), as well as ML model studied and comparator studied (e.g. no comparator, ML-based CPT, traditional statistical CPT or unaided clinician).
To analyze how the CPTs were trained and validated, the source and country of origin of the datasets, average number of input variables studied, validation method used, train-test split ratio used (if applicable), model performance measures used, and best overall models were extracted from all studies along with the aforementioned demographics of the study population. Data was extracted for both internal and external validation, if the latter was performed. Future plans or next steps for each CPT were identified to assess their clinical applicability.
2.3 Quality Assessment
The risk of bias and applicability of the included studies was assessed by two independent reviewers (AB & ZA) using the Prediction model Risk Of Bias ASsessment Tool (PROBAST) [
]. The PROBAST includes 20 signaling questions across 4 domains, including participants, predictors (i.e. input variables), outcome, and analysis.
3. Results
The initial search yielded 9291 studies, of which 8300 remained for title and abstract screening after duplicate removal. Out of the 139 studies included for full-text review, 48 studies were retained for final data extraction (Fig. 1 and Table 1). These included 41 retrospective, 6 prospective, and one study that was both retrospective and prospective. Table 1 highlights general study characteristics and surgical specialties in which ML-based CPTs were used in pediatric surgical decision-making. Information regarding the specific ML models used in each study is summarized in Table S1. Most studies were performed in high-income countries, the United States being most represented (28/48). The application of ML-based CPTs spanned across many different surgical specialties, but most represented were pediatric general (14), neurosurgery (13) and cardiac surgery (12) (Table 1). The most utilized ML methods were random forest (54.2%) followed by decision trees (37.5%) and support vector machines (25%) (Fig. 2 (A) and Table S1). CPTs utilizing ML models had less than 50 input variables per model in 31 out of 48 studies (Fig. 3).
Fig. 1PRISMA Flow Diagram. From: Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. https://doi.org/10.1136/bmj.n71. For more information, visit: http://www.prisma-statement.org/.
Using machine learning analysis to assist in differentiating between necrotizing enterocolitis and spontaneous intestinal perforation: A novel predictive analytic tool.
The Modified Heidelberg and the AI Appendicitis Score Are Superior to Current Scores in Predicting Appendicitis in Children: A Two-Center Cohort Study.
The additive impact of the distal ureteral diameter ratio in predicting early breakthrough urinary tract infections in children with vesicoureteral reflux.
Predictive Analytics and Modeling Employing Machine Learning Technology: The Next Step in Data Sharing, Analysis, and Individualized Counseling Explored With a Large, Prospective Prenatal Hydronephrosis Database.
Predicting ideal outcome after pediatric liver transplantation: An exploratory study using machine learning analyses to leverage Studies of Pediatric Liver Transplantation Data.
Decision-making in pediatric blunt solid organ injury: A deep learning approach to predict massive transfusion, need for operative management, and mortality risk.
Need for massive transfusion (MT), failure of non-operative management (NOM), mortality, and successful non-operative management without intervention in the setting of blunt solid organ injury
Enhanced neonatal surgical site infection prediction model utilizing statistically and clinically significant variables in combination with a machine learning algorithm.
Machine Learning Applied to Registry Data: Development of a Patient-Specific Prediction Model for Blood Transfusion Requirements During Craniofacial Surgery Using the Pediatric Craniofacial Perioperative Registry Dataset.
Fig. 2(A) Most common ML models, (B) model comparator, (C) best performing models when ML and LR (or a variation of LR) are compared, (D) internal validation methods, (E) external validation status, (F) future steps or next steps score, all from the included studies. Refer to Table S1 for more details.
Thirty-nine of the included studies compared ML-based CPTs with either other ML models or logistic regression (representative of statistical CPTs). Nine out of 48 (18.8%) studies did not include a comparator (Fig. 2 (B) and Table S1). The internal validation of the ML model development was a prerequisite to study inclusion, hence studies utilized a train-test split method (23), k-fold cross-validation (11), or both (4), among others (Fig. 2 (D) and Table S1). However, measures of performance varied widely among studies: Table 2 outlines the most common performance measures used. External validation was only performed in 2 studies (Fig. 2 (E) and Table S1). A score from 0 to 2 was assigned to each study summarizing future development plans for their ML-based CPT, where 0 denotes studies that were only proof-of-concepts and 2 denotes studies in which external validation was performed and clinical integration is the next step. (Fig. 2 (F) and Table S1). Ten (20.8%) studies had a score of 0, 22 (45.8%) stated validation with a larger dataset was the next step, but it is not specified to be external validation and were assigned a score 0.5. Thirteen (27.1%) studies were assigned a score of 1 for external validation being the next step, and the only studies with external validation (3, 4.3%) were assigned a score of 1.5 as they were not ready for clinical integration. As seen in Table S1, the majority of studies lacking external validation had future plans of applying the ML model on a separate population, yet such follow-up studies were rarely identified in our search.
Table 2Most common performance measures used in the included studies (N = 46). Note that there are studies that used more than one performance measure.
Performance Measure
N (%)
AUC/AUROC
38 (82.6%)
Sensitivity/Recall
29 (63%)
Specificity
24 (52.1%)
PPV/Precision
18 (39.1%)
Accuracy
14 (30.4%)
NPV
13 (28.3%)
F1-Score/F-Score/F-Measure
9 (19.6%)
AUPR
3 (6.5%)
Legend: AUROC - Area Under Receiver Operating Curve; PPV - Positive Predictive Value; NPV - Negative Predictive Value; AUPR - Area Under Precision Recall Curve.
The majority of ML-based CPTs in which ML was compared to statistical CPTs reported higher performance measures using the ML-based approach (Fig. 2 (B) and Table S1). Out of 22 studies in which ML-based CPTs were compared to statistical CPTs (logistic regression or variants of it), statistical CPTs outperformed ML in only 2 studies. Two other studies found that there was comparable performance for ML versus statistics-based CPTs. One study, by Bertoni et al. [
], found that ML and statistics-based CPTs performed poorly for children needing postoperative overnight monitoring. Therefore, ML-based CPTs had higher discriminative power and greater accuracy whenever compared to statistical approaches (Fig. 2 (C) and Table S1). In some studies where different ML models performed similarly, a slightly less accurate model was chosen due to ease of implementation and integration into clinical workflows.
], ML-based CPTs were divided into diagnostic, interventional, risk stratifying, or prognostic depending on the main outcome(s) of interest (Table 1). Diagnostic CPTs (10) were used for identifying surgical candidates or surgery-requiring conditions, interventional CPTs (9) evaluated the need for a surgical intervention or interventions relating to surgery, risk stratifying CPTs (2) evaluated surgery-related risks. The majority of studies were prognostic CPTs (26) and focused on predicting outcomes after surgery or any factors associated with the post-operative period. Lastly, one study by Marcinkevics et al. [
] developed an ML-based CPT for diagnostic, interventional and prognostic purposes.
In order to answer the question of validity and applicability, scores were assigned to each study highlighting future directions for that specific CPT. Generally, ML-based CPTs lacked external validation, i.e. the scores assigned to each study were ≤1. It is worth noting that in some studies, external validation was performed by “holding out” or “splitting” a part of the population during the validation phase for testing. We have considered this to be internal, rather than external, validation, as the latter was defined to be the implementation of the CPT in a population completely separate from the starting population [
Risk of bias analysis showed that almost all of the included studies were applicable to the review question, but 90% of them had a high risk of bias, mainly in the analysis domain (Fig. 4 and Table S2). This is because they lacked appropriate internal validation techniques (e.g. train-test split only instead of cross-validation or bootstrapping techniques), the performance measures did not evaluate model calibration or discrimination, missing data were not handled appropriately and/or the number of participants with the outcome relative to the number of input variables studied was not sufficient. Risk of bias was unclear in the predictor and outcome domains for 71% of the included studies, primarily because authors did not clearly state whether predictors were assessed without knowledge of outcome data and vice-versa, while taking into consideration that most of the studies were retrospective in nature. Furthermore, a pre-planned meta-analysis of overall performance advantage of ML-based CPTs had to be abandoned because of the unacceptably high risk of bias and heterogeneity of the articles included. It is worth noting that the PROBAST was designed to evaluate studies in which the prediction models are either diagnostic only or prognostic only. However, all signaling questions could be answered for the included studies in which the CPT type differed.
Fig. 4Assessment of risk of bias and applicability of the included studies using the PROBAST.
CPTs use patient data or information regarding the patient encounter to arrive at a simplified decision-making framework. CPTs have had a number of uses within the pediatric and adult populations across different fields [
]. Traditionally, these tools were created using a mix of linear statistics, subjective clinical expertise, and patient data. Clinicians interested in a particular condition would review retrospective or prospective cohorts of patients, then analyze the results using simple linear statistics to highlight predictive correlations between specific variables and the desired outcome(s) [
]. However, due to the subjective nature of their design, traditional statistics-based CPTs have had a number of shortfalls - including performance (missed diagnoses or incorrect diagnoses) and reproducibility. Moreover, mutually contradicting versions of the same CPT could be devised based on the patient population chosen, the variables included for analysis, and the type of analyses performed. For instance, the earliest validated published CPT on the diagnosis of appendicitis was the Alvarado Score - however, over the years, a number of similar or modified scores have been proposed to address some of its shortcomings [
]. Therefore, while most CPTs have proven clinical utility, the very presence of multiple alternative CPTs for any clinical question points to the need for improved tools.
In order to remove (or at least decrease) the subjectivity inherent in the design and development of CPTs, ML-based approaches have been more recently favoured [
]. ML is based on artificial intelligence, and hence includes a number of different methods of data manipulation. The basic principle of ML is to use a computer algorithm that “learns” from the data, given a specific set of inputs and outputs. The inputs are selected widely in order to limit any subjectivity in variable choice, such as patient laboratory values, imaging, or demographic information [
]. The typical workflow of developing ML-based CPTs includes the following: (1) determining the outcome to be predicted for a given population, (2) gathering input-output data from a sample of the population (termed the “internal dataset”), (3) preprocessing (“cleaning”) the internal dataset, and (4) testing and validating the resulting CPT using the internal dataset to evaluate its performance with respect to the desired outcome to be predicted [
]. The optimization of the internal dataset for testing and validation is arguably the most important step in this workflow, ensuring that the resulting CPT is as accurate and precise as possible. However, in order to ensure the reliability, robustness and generalizability of these CPTs and detect any inherent biases in the internal dataset, it is imperative that, prior to clinical integration, they are also validated using an external dataset. The latter refers to a dataset for a population different from the one sampled for the internal dataset [
In this systematic review, we sought to determine the clinical validity and applicability of ML-based CPTs as compared to traditional statistical CPTs in pediatric surgery. To this end, we have identified 48 articles that have met our inclusion criteria and qualified for full-text screening. In addition to lacking external validation, ML-based CPTs are very heterogeneous in design. The sample size used for each CPT and the number of features included as input variables varied drastically across studies. Only a handful of studies addressed their choice for the number of included input variables. The use of ML, however, allowed investigators to include large numbers of input variables in their datasets, well outside the capabilities of traditional statistics.
Most studies were retrospective in nature, reflecting the relatively easy access to electronic health records data, typically without a need for individual consent. However, in keeping with a general lack of uniformity, the studies reviewed used different parametric performance measures for their chosen CPT model. Some common measures included area under receiver operating curve, sensitivity and specificity - most relevant to clinicians who might utilize the CPT (Table 2). It appears, however, that having the best performance measure is not absolutely required, as some ML models are more useful than others depending on the context in which they are used. ML models such as random forest, support vector machines, and artificial neural networks are known for their high predictive performance for nonlinear problems and ability to find complex interactions among input variables, while other models might be chosen for their simplicity, which can improve model understanding and interpretation [
Similarly, different groups chose different internal validation methods. The most commonly used validation methods were train-test splitting and k-fold cross-validation. In train-test splitting, the data is split into a percentage that is used for training and the remaining data is used for testing. The train-test split method was by far the most utilized in our included studies, but the ratio between the training and testing samples varied widely, often without any justification. In k-fold cross-validation, data would be split into k – 1 sets, algorithm performance is measured for each set to optimize parameters, then the algorithm is tested on the kth set of data. Lastly, in studies lacking a comparator to the ML model, it was impossible to determine whether or not the CPT had any clinical validity, or was rather a simple proof-of-concept to the predictive ability of ML-based CPTs. It is clear throughout our review that there is a need for standardization and transparency in how ML-based CPTs are being developed and tested, in order to ensure their safety and applicability in clinical settings. Tools like the PROBAST can be used to guide standardization efforts.
This review has several limitations. The distinction between ML-based CPTs versus traditional statistical CPTs is not clear-cut. During literature search and screening, many studies defined variants of logistic regression as ML, but to arrive at a reasonable number of studies, all logistic regression variants were considered non-ML. Due to the heterogeneity of studies and their design, it was very difficult to obtain information regarding model development and validation, which led to studies being excluded. Articles that might have met the inclusion criteria, but for which full-text was unobtainable, were not included in this study. The study is concerned with the pediatric population, however, the definition for pediatric in our study included only term patients, excluding all premature populations. A limitation of the PROBAST is that some of the signaling questions require clinical or statistical expertise.
5. Conclusions
Our review confirms that, despite the enthusiasm for ML-based CPTs in the literature, their external validation remains elusive, and their actual clinical implementation is rare. While we do not question the potential utility of ML-based CPTs, much work remains to be done in terms of validating these methods in a standardized fashion and transitioning them into the clinical environment. We believe that this is a missed opportunity since ML-based CPTs leverage the advantages of artificial intelligence in handling and analyzing large datasets. In an age where electronic health records have eased data gathering, integration of patient profiles, and information sharing, ML-based CPTs could be the future frontier for establishing efficient decision-making frameworks that improve patient outcomes in pediatric surgery as they have the power to handle such complex datasets. Regardless of the surgical specialty, they have the potential to identify surgical candidates or conditions, evaluate the need for surgery and surgery-related risks as well as predict postoperative outcomes, all of which are prime targets for expanded research and efforts in the pediatric setting.
The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration.
Using machine learning analysis to assist in differentiating between necrotizing enterocolitis and spontaneous intestinal perforation: A novel predictive analytic tool.
The Modified Heidelberg and the AI Appendicitis Score Are Superior to Current Scores in Predicting Appendicitis in Children: A Two-Center Cohort Study.
The additive impact of the distal ureteral diameter ratio in predicting early breakthrough urinary tract infections in children with vesicoureteral reflux.
Predictive Analytics and Modeling Employing Machine Learning Technology: The Next Step in Data Sharing, Analysis, and Individualized Counseling Explored With a Large, Prospective Prenatal Hydronephrosis Database.
Predicting ideal outcome after pediatric liver transplantation: An exploratory study using machine learning analyses to leverage Studies of Pediatric Liver Transplantation Data.
Decision-making in pediatric blunt solid organ injury: A deep learning approach to predict massive transfusion, need for operative management, and mortality risk.
Enhanced neonatal surgical site infection prediction model utilizing statistically and clinically significant variables in combination with a machine learning algorithm.
Machine Learning Applied to Registry Data: Development of a Patient-Specific Prediction Model for Blood Transfusion Requirements During Craniofacial Surgery Using the Pediatric Craniofacial Perioperative Registry Dataset.