Important Info
Please email the required application materials to:
Summer Han, Ph.D. (summer.han@stanford.edu)
Director of the Cancer Data Science Shared Resources Core for the Stanford Cancer Institute
Associate Professor of Medicine, Neurosurgery, and Epidemiology
Quantitative Sciences Unit
Stanford Center for Biomedical Informatics Research (BMIR)
Stanford University School of Medicine
Postdoctoral Fellow in Large Language Models and Electronic Phenotyping in Cancer
We are seeking a highly motivated Postdoctoral Research Fellow with expertise in large language models (LLMs) and electronic phenotyping to join our dynamic team focused on advancing cancer research through innovative data-driven approaches in the Cancer Data Science Core at the Stanford Cancer Institute, directed by Dr. Summer Han, Associate Professor at Biomedical Informatics Research and co-directed by Dr. Allison Kurian, Professor in Oncology. The fellow will work on cutting-edge projects involving the application of state-of-the-art LLMs to unstructured EHR data for identifying cancer phenotypes, treatment patterns, and disease progression. This includes integrating LLMs with structured data sources to develop robust computational phenotyping algorithms and scalable models for real-world evidence generation. The role will involve both method development and applied research, with opportunities to publish in leading journals, present at top conferences, and contribute to open-source tools. Collaboration with clinicians, data scientists, and machine learning experts will be an essential and enriching component of the position.
Strong candidates will have a background in machine learning and natural language processing (NLP), with a demonstrated ability to work with large language models (LLMs) and unstructured clinical text from electronic health records (EHRs). Desired technical skills include prompt engineering, few-shot and zero-shot learning, parameter-efficient fine-tuning methods such as LoRA and adapter-based tuning, and retrieval-augmented generation (RAG) approaches. Familiarity with LLM architectures (e.g., GPT, BERT, T5), Transformer-based modeling, and clinical NLP is highly desirable. Experience with cloud platforms such as Google Cloud Platform (GCP) or Microsoft Azure, and tools like BigQuery, Vertex AI, or Azure ML Studio is a plus. Candidates should also demonstrate strong skills in Python (for ML/NLP tasks) and R (for statistical modeling or data analysis), as both will be actively used in the research workflow. Importantly, this position requires the ability to deeply engage with clinical free-text data—often complex, ambiguous, and domain-specific—to develop effective prompts and modeling strategies. The successful candidate should be comfortable working interactively with chart reviewers (e.g., medical students, residents, or fellows) who create ground-truth labels from manual EHR reviews and be invested in understanding the clinical context that underlies phenotype definitions and labeling decisions.
- A Ph.D. in biomedical informatics, data science, computer science, statistics, or a related field is required.
- The candidate must have demonstrated proficiency in machine learning, natural language processing, and working with large-scale health datasets.
- Strong programming skills in both Python and R are required.
- A record of peer-reviewed publications, strong written and verbal communication skills, and the ability to work independently and collaboratively within multidisciplinary teams are essential.
- A cover letter, a short description of research interests
- CV
- Contact information of three referees