The importance of Solving Data Bias

Artificial intelligence (AI), defined by the engineering of computerized systems able to perform tasks that normally require human intelligence, has substantial potential in the medical imaging field. Machine learning and deep learning algorithms have been developed to improve workflows in radiology or to assist the radiologist by automating tasks such as lesion detection or medical imaging quantification. In some cases, it has been shown to have very good performance, even better than radiologists.

In order to create ML models, for instance, lung cancer detection or COVID detection, AI developers need data to train their models. Such data needs to typically be 80% training / 20% validation. And both training and validation data must be about 50% normal incidences / 50% with the case being studied. However, before medical images can be used for the development of an AI algorithm, certain steps need to be taken. Typically, approval from the local ethical committee is required before medical data may be used for the development of research or a commercial AI algorithm. After ethical approval, relevant data needs to be accessed, queried, properly de-identified, and securely stored. Any protected health information needs to be removed both from the DICOM metadata, as well as from the images. The next step is to structure the data in homogenized and machine-readable formats. The last step is to link the images to ground-truth information, which can be one or more labels, segmentations, or electronic phenotype (eg, biopsy or laboratory results). And that’s where the problems begin.

Developers of AI algorithms are typically not located within a hospital and, therefore, often do not have direct access to medical imaging data through the PACS, especially when AI researchers are developing commercial algorithms. Access to PACS environments is limited to accredited professionals such as physicians, technologists, PACS managers, and clinical scientists. Making data accessible to AI developers is challenging and requires multiple steps, including de-identification of data.

And even if AI developers do have access to hospitals or practices, they are still limited to the local data and the lack of diversity causes another problem in AI: datasets may be biased and reflect an application only partly. As an example, a dataset collected as part of a population study might have different characteristics than people who are referred to the hospital for treatment (higher incidence of a disease). Dataset bias occurs when the data used to build the model, has a different distribution than the data on which it should be applied. With such a mismatch, algorithms that score high in benchmarks can perform poorly in real-world scenarios.

Now, access to lots of diverse data is not all. Data quality is also important in machine learning. The ML models learn the statistical associations from the historical data and will only be as good as the data it is trained on. So good quality data becomes imperative and a basic building block of any ML pipeline.

Lots of AI companies and researchers have great expertise in ML, but one of the biggest barriers is making sure they are creating unbiased algorithms. As alluded to in this blog, only a few, very large businesses have access to a high volume of high-quality and diverse datasets, as achieving such a feat is very costly and time-consuming.

If you’re interested in knowing more about Gradient Health and how we help businesses with data access send us a message.