Lessons Learned in Regulated Medical AI Development

Background

Following our regulatory approval of another AI product, I thought I’d write down a set of lessons learned while they’re still fresh in my mind. These have to be kept somewhat high level for reasons that should be obvious, but hopefully they’ll prove useful to someone else who is just beginning to embark on an effort to improve medical outcomes using artificial intelligence.

The following sections can be read independently:

Assist, Don’t Replace, Human Experts
Don’t Overly Focus on Model Architecture
Don’t Optimize Solely for Performance
Focus Labeling Effort on Uncertain Classifications
Beware of “Frequent Flyers” in datasets
Control for Lopsided Datasets in Loss Function
Augmentation Can Work Surprisingly well
Invest in your Processes Early On
Use Metrics that Clinicians Understand

Assist, Don’t Replace, Human Experts

Clinicians are increasingly becoming overwhelmed by a data deluge, presenting prime opportunities for AI to introduce significant time savings. Every hour they spend reviewing electronically transmitted cardiac electrograms is an hour taken away from treating patients.

However, any AI product which makes clinical assessments independent of human oversight is going to be heavily (and rightly) scrutinized by regulators such as the FDA or TUV. Perfect sensitivity is an unrealistic performance goal for most real-world applications, which means that an occasional false negative will occur, and an important piece of medical information will not reach human eyes. Regulators are understandably concerned by this, no matter the overwhelming net benefits that the product may otherwise offer. To mitigate these concerns, bypass mechanisms may need to be introduced, where the AI gets overruled if a certain time has passed since the last communication or the data meets certain characteristics.

Acceptable Design	Better Design
Remove AI-adjudicated false clinical events from human oversight.	Segregate AI-adjudicated false events, while maintaining the option for human review.

Don’t Overly Focus on Model Architecture

When I implemented our initial atrial fibrillation (AF) classification model a few years ago, I quickly hit a ceiling where the model performance approached, but never surpassed, the level of human agreement. This makes sense: If three human expert labelers only agree 80% of the time, then it follows that a model which has been trained to predict their decisions will likewise struggle on those 20% of weird edge examples.

Nevertheless, I persisted much longer than I should have in looking for other creative ways to improve model performance through various architectural tweaks. When I handed off my work to a much smarter data science PhD who later joined the team, she made several further refinements, but in the end, the final model’s performance was very close to the original’s. We even tried outsourcing the challenge to a company-wide Hackathon and got similar results.

Fast-forward a couple of years to the advent of state-of-the-art open source models like ECGFounder, and the classification performance remains almost unchanged despite the new model being over 20X the size.

As another anecdote, I trained a convolutional neural network (CNN) to detect cardiac lead noise a few years ago as a side project. In hindsight, I architected the model “upside down”, with wide kernels leading to narrower ones rather than the standard opposite approach. Yet despite my upside-downness, the model reached amazing levels of accuracy.

Don’t Optimize Solely for Performance

A common, almost universal, norm in machine learning is to standardize model inputs. In the context of a cardiac time series signal or a 2D medical image, this usually means adjusting the input to have a mean of 0 and a standard deviation of 1. This helps to avoid saturating the model’s activation functions, which can lead to vanishing gradients.

Yet as we discovered by accident, sometimes models will perform noticeably better on certain input types if this standardization step is skipped. This intuitively makes sense for certain types of cardiac signals with low-amplitude R-waves, but in hindsight it is not a good idea.

The problem is, even if a model performs a bit better on paper with non-standardized data, it will likely be “highly opinionated”—meaning the tiniest signal change can lead to large swings in the sigmoid classifier output. If you find yourself with a model where a very low threshold of 0.0000023 provides optimal binary classification performance, then it may be time to rethink your approach. This will save you future headaches when it comes to real-world performance monitoring following the product release.

Focus Labeling Effort on Uncertain Classifications

Labeling medical data is usually expensive and time-consuming. Even if you increase your dataset size by 10X, your model performance may improve only slightly.

Instead, we were able to considerably improve our model performance by identifying the boundary cases where the model was less certain in its classifications, and then labeling them and adding them to the dataset.

For example, if you run your binary classification model on 1 million real-world inputs, It will likely be quite confident in its labels for most of them—i.e. it will generate scores very close to 0 or 1. It’s the ones where it’s less certain, where you get scores between 0.1 and 0.9, for example, that you can get the most benefit from adding to the dataset. These likely will require considerable human effort to label—after all, if a mature model struggles with them, humans probably will too—but a surprisingly small addition can lead to marked improvement.

Beware of “Frequent Flyers” in datasets

Medical data often follows a Pareto distribution, where most clinical events come from a small minority of patients.

If you just randomly sample from these clinical events to create your training set, it’s very likely that a few patients will be heavily overrepresented. Even worse, if some of these same patients’ data ends up in your validation set, you can get a misleadingly rosy picture of model performance.

When creating datasets, first group the examples by patient, then ensure that no more than N samples are chosen from each. Furthermore, as your datasets continue to grow and evolve, it can be very easy for “leakage” to occur, where data from the same patient might find its way into both training and evaluation sets. You’ll save yourself lots of headaches towards the end of the project if you have a process and scripts in place to automatically detect and flag these patients.

Control for Lopsided Datasets in Loss Function

Real-world medical datasets will often be skewed in their numbers of samples from each category. In a simple binary case, there might be far more “true” than “false” detections, or vice versa. You can control for this somewhat in your dataset selection by maintaining a ratio of, say, no more than 2-3 “major” examples for every “minor” example, but it’s often necessary to adjust the loss function itself. For example, in cases where it’s critical to maintain minority example classification accuracy, simple loss functions like focal loss should be considered.

Augmentation Can Work Surprisingly well

Many time series medical signal recordings, such as cardiac electrograms, can be augmented considerably without affecting their classification. For example, if a recording shows evidence of atrial fibrillation (lack of p-waves; R-R interval irregularity; premature ventricular contractions), then randomly stretching or shrinking that same signal by up to about 20% probably won’t change its label. It can similarly be flipped, mimicking real-world cases of electrode inversion, or shifted by a random number of seconds. Adding low level Gaussian noise is also surprisingly effective.

Invest in your Processes Early On

Creating medical AI products requires biomedical and data science expertise, and most such experts are not software engineers. Many are unfamiliar with Git, and those who are familiar with it are often hesitant to use it for anything other than basic staging and commits. And yet, things like traceability and reproducibility are paramount. If you wait until the business decides to launch your research work as a product before you formalize your processes, you may be in for a lot of extra work.

I made the mistake of just verbally suggesting processes in passing, without proactively following through to make sure that everyone was indeed on the same page. It wasn’t until months later that I discovered that, while most of the team thought they were following the agreed-upon process, they weren’t actually. For example, the data science code contained a sea of commented out variables, one for each prior experiment. This made traceability especially difficult, and it took us quite a lot of late rework to get our ducks in a row.

Use Metrics that Clinicians Understand

When it comes to assessing model performance, regulators aren’t going to care about how great your F1 score is. (Probably nobody else will either, come to think of it.) You need to represent your model performance in terms that clinicians are familiar with: Sensitivity, specificity, PPV, NPV, and last (and probably least), accuracy. You are going to need to show that the model performs acceptably across different patient demographics including age, sex, race, and geographic location. It’s still possible to successfully build a product in the absence of such identifying characteristics, but it will require particularly compelling evidence with sufficient statistical power. If you’re unsure whether a regulator will accept your submission, sending them a “pre-sub” list of questions can be quite helpful for reducing risk. Had we not done so, we might have spent months proceeding down the wrong regulatory pathway.