Precision immunology requires precision data
After a decade iterating on AI drug discovery platforms from Allcyte/Exscientia to Graph, precision medicine demands active investment in data, biological context, and repeated experimental validation
We founded Graph as a techbio in immunology drug discovery knowing that closely and scalably recreating disease biology in the lab is critical to being able to trust any predictive tools building on our data (garbage in, garbage out). This led us to invest a lot of R&D into scaling viable primary immune-disease patient cells as our model system of choice.
Through a decade of evolving AI-driven precision medicine platforms we have shown that starting with live patient samples can deliver real impact in translational settings: in oncology, our EXALT-1 study showed that when you invest in generating the right biological data, capturing patient heterogeneity, functional responses, and clinically relevant endpoints, you can capture real patient responses long before actually entering clinical studies. In Exscientia's translational research programs we took this further by combining functional assays with multi-omics for positioning and pre-clinical biomarkers, uniquely shaping early development programs.
We think the same holds true for immune-mediated diseases: better (biological) models -> better data fit -> more accurate predictions -> faster iteration to the clinic.
The Precision Data Gap: Biological Context Will Determine Success of AI Drug Discovery
The rapid expansion of single-cell datasets, e.g. through massive open data initiatives like the Chan Zuckerberg Initiative's billion-cell project, has fueled the development of biological foundation models and put the sights of the community back on comprehensive virtual cell models. These models promise to reduce our reliance on experimental capabilities (virtual screening), or at least refine hypotheses for validation.
Nonetheless, the ability of current perturbation prediction models to accurately predict disease biology across diverse patient cohorts remains limited in large part due to training on limited biological and often unrelated contexts - something inherent to "big data" research where investment prioritizes immediate scale. Publicly available biological data has significant gaps where systematic innovation has yet to catch up:
Over-reliance on cell lines creates systematic biases that compromise clinical relevance. While cell lines enable reproducible high-throughput screening, they fundamentally cannot capture the genetic diversity, tissue microenvironments, and functional complexity of primary human immune responses in disease states.
Limited perturbation space coverage leaves much biology unexplored. Current large-scale datasets focus heavily on genetic knockdowns and small molecules, while missing disease-relevant stimuli such as pathogen-associated molecular patterns, tissue damage signals, and the complex cytokine combinations encountered in inflammatory diseases. Most studies provide static snapshots rather than longitudinal perturbation responses. Even the most ambitious projects quickly run into combinatorial explosions trying to take on this complexity (if they do), but the “single large screen” is still at the center of many research efforts.
Missing clinical variability represents perhaps the greatest challenge. Our work in oncology showed that understanding disease heterogeneity between individual patients is critical to translating biological findings. Cell lines with demographic gaps, limited disease spectrum representation, and missing environmental context severely limit generalizability to diverse patient populations. Meanwhile, interventional capability in a clinical context, where variability is accessible, is understandably very limited. As a result of this data bottleneck, even sophisticated foundation models struggle with generalizing to new perturbations in heterogenous patient cohorts.
The immunology data gap is particularly acute. Massive oncology datasets have created an assumption that similar datasets exist for every important biological question. This assumption breaks down in precision immunology, where high-quality, contextually rich datasets for complex diseases are few.
To make matters even more complex, the objective of precision immunology is to rebalance how dozens of cell types coordinate through spatial interactions, soluble factor signaling, and dynamic phenotypic transitions. A hepatocyte cell line might inform about basic compound toxicity, but offers no insight into T cell exhaustion in inflamed joints or macrophage polarization responses, and studying one cell type without the other ignores the critical interplay that drives disease.
This data scarcity is a challenge, but also a competitive advantage for companies willing to invest in fit-for-purpose, clinically relevant data generation. Many AI drug development companies either focus on validated, but hard to drug targets - or adopt fast follower or best-in-class strategies to take advantage of efficiency gains on the chemistry side. The foundational exploration of disease mechanisms has proven harder to automate, but is no less important: lack of understanding at this stage compounds through the whole pipeline. This is even more important in immunology, given the constant dynamic response of the immune system to its environment. So how can this critical stage be scaled effectively in I&I?
Lab-in-the-Loop relies on validation in clinically relevant contexts
“Lab in the loop” setups have gained traction in recent years, representing a tighter coupling of analytics and experimental validation than has been done traditionally. Lab in the loop frames highly multi-dimensional optimization problems (such as finding the perfect antibody, small molecule, or disease target) as an iterative search problem: instead of committing to large one-shot screens, one assumes a validation budget that should be optimally deployed over a number of successive experimental cycles. Before each cycle, the researcher or an automated system can course-correct based on the outcomes of the previous cycles, leading to a gradual enrichment of “good” outcomes.
In one sense, modern science has always been “lab in the loop”, but as an explicit organizing principle it carries a number of important implications: for one, it makes validation failure expected and desired, as long as that failure is informative. By extension, there is a high emphasis placed on frequent validation.
Applied to discovery of novel therapeutics, this means striking the right balance between operational simplicity and cycle speed on one side, and clinical realism on the other, with the balance itself being another knob to turn. It also means the smart integration of prior knowledge (and uncertainty quantification) into the loop: scarce experimental resources should be directed to where they add the most to our understanding of the disease.
At Graph, rather than casting the widest possible net, we're building focused, responsive experimental frameworks that can rapidly test, refine, and build upon AI-generated hypotheses in relevant primary patient models.
Data is Strategic Infrastructure
The next generation of breakthroughs in AI drug discovery will come from companies that recognize better algorithms and better biological validation capability as complementary investments. This means funding data generation just as one would capex, and understanding that biology needs context.
But more fundamentally, it's about risk management and achieving reliable outcomes. The industry currently operates with two primary approaches to managing the inherent uncertainty in drug discovery. The first is the "YOLO approach" - placing big bets on a few promising targets and hoping for the best, with portfolio-style risk management distributed across multiple companies and investments. While this can work at the venture level, it's an inefficient way to deploy scientific resources.
The alternative is to internalize and systematically address risk through principled approaches that reduce uncertainty at the source. This means building experimental frameworks that can efficiently distinguish between promising and dead-end hypotheses before committing massive resources to development. It means generating data that actually reflects the biological complexity you're trying to solve, rather than hoping simplified models will translate.
This should be more efficient for fundamental reasons: when you reduce biological uncertainty early and systematically, you spend less time and capital pursuing targets that were never going to work. You build institutional knowledge that compounds across programs. You create reproducible processes that can be scaled and improved rather than starting from scratch with each new indication.