Clinical Text Classification Project — CT/MRI reports (Data Discovery)

3 min readDec 7, 2020

Background:

Every time you visit the doctor, your complete medical history and your data are captured by huge electronic health records (EHR) systems. This data possesses a lot of potential usages for new insights in precision medicine. However, a major roadblock in the optimal repurposing of this data is that they are mostly unstructured (free-text) rather than well-organized and ready to analyze. These electronic health records also are filled with healthcare jargon that makes it difficult to fit with current text classification models.

UCSF currently has a pair of classifiers that perform the sequential tasks: 1) identifying classifiable reports according to the Mayo Score (a disease activity score for Ulcerative Colitis) 2) assigning a Mayo score in relevant reports. In this project, I — along with 3 other students — worked with UCSF gastroenterologist Vivek Rudrapatna to extend these techniques to other reports. As such, we intend to answer the question: given the limitations posed by unstructured and jargon-filled data in electronic health records, how can we expand on current clinical techniques to abstract important data for Crohn’s disease using self-supervision +/- supervision, general clinical notes, and CT and MRI reports. In doing so, we hope the models that we create can prove to be a step taken towards unlocking the full information content hidden within Electronic Health Record (EHR) systems and accelerating clinical processes and research.

My particular project focuses on the CT/MRI reports aspect of the project. By carefully observing reports in the database, identifying keywords and patterns for regex-based features, I contributed in cleaning, organizing, and preprocessing the data for further development. Although the final product we are hoping for is still in-progress, after this semester ends, I will be continuing my work to develop and train a classifier to predict the label from different reports (active vs inactive disease). At the final stage, we hope to publish our results and reproduce our code to allow for more members of the public to use.

The Data:

The data we used is from UCSF Medical Center’s extensive EHR system. I worked with a dataset of over 13,000 rows for patients at risk of inflammatory bowel disease (IBD). There may be biases in the data depending on how different doctors take note of patient conditions. The data was largely unstructured and had missing values. To combat this, I cleaned the data by parsing the “notes” section (see below) and reformatted it by emphasizing on the “Impressions” (or the summaries of the report).

Example of Parsed CT/MRI reports (“mrn” refers to patient number, “service_date” refers to date and time of report, and “result” is the complete report)

Example of Cleaned Dataframe. “Impressions” are the primary summary of the report, which are expanded to have 1 impression per row. The columns to the right refer to indicator variables that display the presence of various phrases and words that are common to IBD and whether they are present in the Impression or not. This can be further used for classification purposes.

Example of regex-based features to parse notes and create indicator variables in initial stages of text extraction.

Your Solution/Model

We are still in the process of designing a model for classification, but so far, we have created a structured dataframe with features to use for classification. Since we are still working on finalizing the model, we do not have results yet, but we expect to develop, validate, and deploy a series of text classification models on a variety of textual documents present with electronic health records systems. We intend to compare our models to human-level performance and create a classifier that can label active from inactive disease with at least ~92% accuracy. As a result of our finalized work, we hope to save time and resources in healthcare facilities by freeing up human time for other tasks and accurately predicting which EHR files relate to disease. In doing so, we can further diagnose and treat patients more efficiently. In terms of sustainability, so far, the code we have may be a bit difficult to follow along, so we can definitely continue to record our steps and observations better to allow for readers to follow along.

Written by Abraham Niu

This blogpost is for my Data Discovery project presentation.