What Three Failed Clinical Candidates Taught Us About Computational Prediction

Before the compounds currently in our pipeline, there were three that didn't make it. Not compounds we abandoned at preclinical stages because the models flagged problems — those we stopped early, as intended. These three cleared preclinical, passed our own confidence thresholds, and failed after entering clinical development. Writing about this isn't comfortable. It's necessary.

What we learned from those failures shaped the platform we use today more than any academic paper or consultant's recommendation. The patterns across all three failures were instructive in ways that weren't obvious until we could look back at them together.

Candidate One: The Bioavailability Failure

The first compound was a serine protease inhibitor with excellent in vitro potency — enzyme Ki of 4 nM — and clean predicted ADMET profiles. The microsomal stability assay looked fine, the Caco-2 permeability was acceptable, and the rat PK data showed oral bioavailability of approximately 38%. We filed the IND on the basis of that rat data.

In the Phase I study, the compound showed oral bioavailability in humans of under 3% at the proposed doses. The plasma exposures were so low that pharmacological activity was not plausible at any dose in the tested range. The program was stopped after the first cohort.

The retrospective analysis identified the problem: the compound was a substrate for intestinal first-pass metabolism by a CYP3A4-mediated mechanism that our rat PK studies didn't capture — rats express CYP3A subfamily enzymes at different concentrations in the intestinal wall compared to humans. The in vitro microsomal assay had used liver microsomes; the intestinal wall issue would have required intestinal microsomal assays or Caco-2 cells with CYP3A induction. We didn't run those assays because nothing in our standard panel flagged the liability.

We've since added intestinal microsomal stability and CYP3A4-induced Caco-2 flux as mandatory assays for any compound before clinical candidate nomination.

Candidate Two: The Selectivity Gap

The second failure was a kinase inhibitor where the computational selectivity profile looked acceptable. Our models predicted greater than 100-fold selectivity over the twelve kinases we monitored, which met the selectivity criteria for progression. What the models didn't adequately capture was activity against a panel of kinases we hadn't included in our training data for that target class.

A dose-dependent skin rash appeared in the Phase I cohort at doses below the projected therapeutic range. Mechanistically, the rash was consistent with off-target inhibition of an epidermal growth factor receptor family kinase. Retrospective profiling against a 468-kinase panel — something we should have done before clinical entry — found IC50 values below 500 nM against three EGFR family kinases. The computational model had not been trained on those data points and had no way to flag the liability.

The kinase selectivity panel is now comprehensive from the start of lead optimization, not at candidate nomination. The models are regularly retrained as new selectivity data comes in. We've also moved to a selectivity threshold that requires experimental measurement rather than predicted-only confirmation for any kinase within 100-fold of the primary target IC50.

Candidate Three: The Translation Problem

The third failure is the one we find most instructive and most difficult to attribute to a specific modeling error, because the problem wasn't one the models could have predicted: the disease biology in the patient population differed from our preclinical model in a way that eliminated the pharmacological effect entirely.

The compound was an anti-inflammatory small molecule designed to inhibit a cytokine receptor. Our animal efficacy data was strong — two independent rodent models, clear effect size, dose-dependent response. The biomarker target we'd identified in preclinical studies dropped as predicted in the Phase I pharmacodynamic sampling. Everything looked mechanistically consistent.

The Phase II efficacy readout was unambiguously negative. Post-hoc biomarker analysis suggested that the patient population enrolled had high rates of compensatory cytokine pathway upregulation that wasn't present in any of our preclinical models. Inhibiting the primary receptor was being compensated for by a parallel signaling pathway within 48 hours of dosing. The preclinical models didn't develop this compensation, apparently because they were acute models that didn't recapitulate the chronic inflammatory state of the patient population.

There was no computational model that could have predicted this. The lesson here was different: patient stratification should have been built into the trial design from the start. Enrolling an unselected patient population with a heterogeneous underlying biology and expecting a homogeneous treatment effect was the error. We now require a pre-specified stratification biomarker in every efficacy trial design, even when it's not obvious that patient heterogeneity will matter.

The Aggregate Learning

Looking across the three failures, the pattern is that each one represented a category of prediction error the models weren't designed to catch: species-specific metabolic mechanisms, kinase selectivity gaps outside the monitored panel, and biology that simply doesn't translate across species. Improving models in the domains they already cover wouldn't have changed any of these outcomes.

What changed our programs was expanding the experimental envelope — more comprehensive panel assays, species-specific PK studies, and earlier integration of clinical biomarker strategy — and building feedback loops so that each failure generates training data that improves the next program. That's less elegant than claiming AI solved the problem. It's more accurate.

What Three Failed Clinical Candidates Taught Us About Computational Prediction

Candidate One: The Bioavailability Failure

Candidate Two: The Selectivity Gap

Candidate Three: The Translation Problem

The Aggregate Learning

Programs Built on Hard-Won Data