So you want to build a medical algorithm? Maybe you’re looking for skin images of diverse populations. Perhaps you’re looking for COVID-19 lung nodule CT scans or subarachnoid hemorrhages. While there has been a lot of excitement surrounding big data and the possibilities it poses for natural language processing (NLP) and computer vision, medical A.I. has lagged behind because it misses the foundational input for algorithms: large, labeled datasets.
When looking online for large, labeled datasets for health applications, accessing data is harder because of privacy concerns and differing IT systems [1]. In addition to this, medical datasets can cost millions of dollars to acquire or create, as much of the research for medical algorithms is being funded by big private corporations that don’t share huge portions of their data with the public [2]. Huge research institutions however are struggling with providing large medical datasets to their academic research teams, even when they have access to teaching hospitals and institutional support. Hoping to spur crowd-sourced AI applications in health care, Stanford’s AIMI center is expanding its free repository of datasets for researchers around the world. Stanford HAI’s AIMI has teamed up with Microsoft’s AI for Health program, with the aim to build a research ecosystem and democratize medical datasets [3].
Also, while medical imagery exists for certain pathologies, background noise of other medical conditions lifestyles can confound medical algorithms that are highly accurate on small sets of predetermined data but weakly aligned with real world applications. This becomes a key challenge in logistical implementation in medical AI development [4].
What are medical images?
A medical image is anything that pertains to accessing a pathology. These include medical Illustrations or Anatomy. Medical imaging is typically stored in DICOM format for CT, MRI and PET and is the international standard for saving medical data. What makes medical images uniquely difficult to source? Privacy – health algorithms uniquely tell a person’s story to the degree that all personally identifying information needs to be scraped, leading to huge confidentiality interests for both the hosting organization and the patient [5]. While organizations like the National Academy of Medicine continue to advocate for a “learning healthcare system” that produces constantly updated reference data during the care process, openly sharing healthcare data is a long way out [6].
“The reason Gradient Health exists is because not having access to large datasets without institutional backing sets back humanity’s future.” Ouwen Huang, CSO of Gradient, said. “You won’t have data access at even large research institutions; access to data plagues all institutions regardless of size.” This of course limits how large open-source medical datasets are online.
How to use this Guide:
In this guide, we created a preamble to this directory to illustrate the challenges behind open source datasets. We collected a list of DICOM, pathology images, and statistics to get a head start on your data science and A.I. projects. Use the CSV file below to get access of current open-source medical datasets by name, organizer, modality and organ target. Get in contact with us @gradienthealth when you find more open source datasets and we’ll share them here!