The Data Quality Problem in Drug Discovery Nobody Talks About

The drug discovery industry has a data quality problem it mostly prefers not to discuss. Not because there's anything shameful about it — bad data accumulates in every maturing scientific field — but because acknowledging it complicates the story that AI-driven discovery is supposed to tell. Models trained on bad data produce bad predictions. If the training data is systematically flawed, better algorithms won't fix it.

Here's what's actually in the public bioactivity databases, and why it matters for every model that claims to predict binding affinity, selectivity, or ADMET properties.

The Assay Heterogeneity Problem

ChEMBL, the primary public bioactivity database, contains over 18 million activity data points from published literature. That sounds like a rich training set. The problem is that those data points come from thousands of different assay formats, run in hundreds of different laboratories, measured by different methods, against different protein constructs, at different temperatures and pH, using different DMSO concentrations, with different positive controls and different compound handling procedures.

A Ki of 50 nM for compound X against target Y measured in a fluorescence polarization assay is not the same as a Ki of 50 nM measured in a radioligand competition binding assay. They report the same number, but they measure different things. One might report a compound as inactive that the other would call a micromolar hit, depending on the assay's sensitivity to the binding mode.

When you train a model on that pooled data, the model learns a fuzzy average across all those assay conditions simultaneously. For common target classes with thousands of data points across many assays, the noise partially averages out. For less-studied targets where you have fifty data points from three papers using three different assay formats, the model has no principled way to separate the signal from the measurement variability.

SMILES Errors Are More Common Than You'd Expect

A study published in the Journal of Chemical Information and Modeling in 2023 examined curated subsets of public bioactivity databases and found SMILES representation errors in approximately 2-4% of entries. That might sound small. Across 18 million data points, it represents hundreds of thousands of potentially corrupted training examples. Errors range from simple transcription mistakes to stereochemistry assignments that are wrong or missing, to records where the compound tested and the compound deposited don't correspond to the same structure.

We've encountered this directly. During validation of our ADMET models, we identified a cluster of training compounds where predicted and measured aqueous solubility differed by more than two log units. Manual inspection found that a subset of those compounds had incorrectly assigned stereocenters in the public data — a chiral center that should be R was recorded as S, changing the predicted and modeled properties substantially.

Activity Cliffs and Model Failures

Activity cliffs — pairs of structurally similar compounds with large differences in biological activity — are genuinely common in drug discovery SAR. An analog series might show 100-fold potency differences from a single methyl group addition. These are real, meaningful phenomena. They're also exactly the kinds of transitions that machine learning models handle least well, because smooth interpolation between training examples doesn't capture sharp non-linear structure-activity relationships.

Models that perform well on held-out test sets from the same training distribution often fail dramatically at activity cliffs. The benchmark metrics look fine. The predictions in real programs fail in the exact cases where the chemistry is most interesting.

The practical consequence is that high model performance on published benchmarks is a poor predictor of performance on novel scaffold hops. Every company in this space presents benchmark results. Very few publish prospective validation data — predictions made before the experimental data existed — because that data is unflattering.

How We Handle It

We run our own experimental data generation as a foundation layer. For any target class where we're building a prediction model, we generate a proprietary dataset using a single standardized assay format in our CRO partner's facility, with consistent compound handling, controlled DMSO concentration, and counter-screens for aggregation and fluorescence interference. That dataset is small compared to ChEMBL — typically 1,000 to 3,000 compounds — but it's internally consistent in a way that public databases are not.

The public data is still useful, but we treat it as a prior rather than ground truth. Models are initialized on the public data, then fine-tuned on the proprietary set. The fine-tuning data dominates for predictions near the chemical space of interest; the public data provides regularization and helps with targets where we have limited proprietary coverage.

This isn't a complete solution. No amount of data curation fully solves the problem that experimental biology is noisy. But treating data quality as an engineering problem that can be solved with better algorithms is how you end up with beautiful benchmarks and programs that fail in the wet lab.

The Data Quality Problem in Drug Discovery Nobody Talks About

The Assay Heterogeneity Problem

SMILES Errors Are More Common Than You'd Expect

Activity Cliffs and Model Failures

How We Handle It

Data-Grounded Discovery