Project Description: Early phase clinical trials often operate in a “large p, small n” regime, with many covariates, multiple endpoints, but small sample sizes. Recent advances in foundation models, e.g., GPT-style or domain-specific LLMs, offer potentially new opportunities to extract insights through joint modeling of large dimension data and through embedding of biomedical context into classic predictive models such as XGBoost. This project will investigate innovative strategies that bridge Generative AI and Predictive Modeling, focusing on but not limited to the following directions:
• LLM-guided hypothesis generation (undergraduate level): Prompt domain-specific large language models with trial and literature context, to recommend clinically relevant subgroups and plausible effect modifiers for focused subgroup discovery and enrichment • Empirical priors for endpoint modeling (graduate level): Identify and adapt joint endpoint distributions from pre-trained medical generative models, to be used as empirical priors • Synthetic data augmentation (graduate level): Train and validate tabular diffusion models on internal data that preserve observed endpoint relationships, to augment study data with synthesized patient-level records
Keywords: Generative AI and Diffusion Models; Predictive Modeling of Multiple Endpoints; Clinical Trials; Subgroup Identification
Tools and Skills Students will Use and Learn: Implementation of GANs, VAEs, or Diffusion Models and training models to synthesize biologically plausible clinical patient data
Preference for Student Profile • Machine learning basics • Conceptual understanding of GenAI frameworks like GANs, VAEs, Diffusion • Proficiency in python or R
- Fall mentor time: Tuesday: 3:30 PM Eastern
- Fall lab time: Thursday: 3:30 PM Eastern
- Spring mentor time: Tuesday: 3:30 PM Eastern
- Spring lab time: Thursday: 3:30 PM Eastern
- Industry: Pharmaceuticals
- Tools: python, r
- Topics: llms, machine learning, pharmacy, statistical modeling
- Requirements: Open to all students