Use at your own risk

Risk Assessment of AI/ML Models in Healthcare

As artificial intelligence and machine learning (AI/ML) algorithms are utilized, we must ask ourselves, what are these models made of? Many neural networks operate in a black box: we give a model a goal of accuracy and training data and BAM! We the public ‘hope’ that the model is ‘correct’. By all accounts, the model is mathematically accurate and precise according to its training sets, but rarely accounts for algorithmic bias [1]–[3]. Model performance is often assessed using data that is easily available, rather than data that reflects the target population of actual model use. As AI/ML models are increasingly being used in clinical settings, the importance of being able to know how, when, how not, and when not to incorporate model output into clinical decisions is imperative. However, utilizing current models as they exist now may pose problems. 


While machine learning engineers may be aware of these issues, health professionals may utilize a model uncritically, assuming that the computer scientists that created the algorithm would also account for the unique risks posed by each patient. As clinicians do, they may assume that risk has been minimized by analyzing health records as a doctor might, which is not the case. Models tend to see patient health and patient cost as synonymous [4]. Clinicians must be aware of the key risk indicators (KRIs) to estimate the potential risks utilizing this model whether it is safe and secure to deploy a given ML model in a specified environment  [5]. These KRIs could address the robustness of a machine learning model to random input corruptions, distributional shifts caused by a changing environment, and adversarial perturbations. Clinicians and health team stakeholders are often unaware of the potential harm to patients that arise from clinical AI/ML that do not easily allow clinicians to work out their patients’ risk assessment. Machine-learning offers opportunities to improve accuracy by exploiting complex interactions between risk factors and currently can streamline hospital professionals’ workflows.


Take for example assessing the risk a patient has for cardiovascular disease in 10 years [6]. In 2017, researchers analyzed 378,256 patients from UK family practices, free from cardiovascular disease at outset. The participants were between 30 to 84 years at the start of the survey and they completed data for eight core baseline variables: sex, age, smoking status, systolic blood pressure, blood pressure treatment, total cholesterol, HDL cholesterol, and diabetes.  Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict the first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the ‘receiver operating curve’ (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins). There were 24,970 recorded cardiovascular events (6.6%). Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others. However, it also oversimplifies complex relationships with health data. Imagine if patients with asthma inappropriately were treated by this model as asthma was not one of the determining factors assessed. 


In 2015, the Transparent Reporting of Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement was released to improve the reporting of prediction models in published literature. It developed a set of recommendations for the reporting of studies developing, validating, or updating a prediction model, whether for diagnostic or prognostic purposes on a checklist of 22 items, deemed essential for transparent reporting of a prediction model study [7]. Furthering risk communication that is accessible, Sendak et al. have pushed to create “nutrition” labels for medical algorithms [8]. The “Model Facts” Label  is an interdisciplinary effort by developers, clinicians, and regulatory experts for clinicians who make decisions supported by a machine learning model. It follows the risk communication defined by the USA FDA as the “the term of art used for situations when people need good information to make sound choices likening the model contents to food labels. Model Facts is a one page document that hopes to present information presented to the end user when it’s not immediately available clear that a model was involved. By understanding the objectives of the model designed by ML scientists and engineers, clinicians and other end users will be able to make better judgments for their patients.


Risk assessment in healthcare is a continuous process [9]. Confronting the challenge of integrating healthcare with machine learning rests on deep, consistent collaboration between clinicians, machine learning engineers, designers and patients. Training our models to be transparent to end-users not specialized in AI/ML is imperative to developing risk assessment guidance to our most vulnerable populations.


[1] C. O’Neil, Weapons of Math Destruction. 2016.

[2] S. Wachter, B. Mittelstadt, and C. Russell, “Why Fairness Cannot Be Automated: Bridging the Gap Between EU Non-Discrimination Law and AI,” May 2020, doi: 10.2139/ssrn.3547922.

[3] T. Panch, H. Mattie, and R. Atun, “Artificial intelligence and algorithmic bias: implications for health systems,” J. Glob. Health, vol. 9, no. 2, p. 020318, doi: 10.7189/jogh.09.020318.

[4] T. Simonite, “When It Comes to Health Care, AI Has a Long Way to Go,” Wired. Accessed: Feb. 21, 2022. [Online]. Available:

[5] P. Schwerdtner et al., “Risk Assessment for Machine Learning Models,” ArXiv201104328 Cs, Nov. 2020, Accessed: Dec. 07, 2021. [Online]. Available:

[6] S. F. Weng, J. Reps, J. Kai, J. M. Garibaldi, and N. Qureshi, “Can machine-learning improve cardiovascular risk prediction using routine clinical data?,” PLOS ONE, vol. 12, no. 4, p. e0174944, Apr. 2017, doi: 10.1371/journal.pone.0174944.

[7] G. S. Collins, J. B. Reitsma, D. G. Altman, and K. G. Moons, “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement,” BMC Med., vol. 13, no. 1, p. 1, Jan. 2015, doi: 10.1186/s12916-014-0241-z.

[8] M. P. Sendak, M. Gao, N. Brajer, and S. Balu, “Presenting machine learning model information to clinical end users with model facts labels,” Npj Digit. Med., vol. 3, no. 1, pp. 1–4, Mar. 2020, doi: 10.1038/s41746-020-0253-3.

[9] G. K. Kaya, J. R. Ward, and P. J. Clarkson, “A framework to support risk assessment in hospitals,” Int. J. Qual. Health Care, vol. 31, no. 5, pp. 393–401, Jun. 2019, doi: 10.1093/intqhc/mzy194.

Scroll to Top