Podcast: Using large language models to extract PD-L1 testing details from electronic health records

5 Jun 2024

Artificial intelligence and machine learning Events coverage Health data Interviews ISPOR Oncology and hematology Peek behind the poster

In this Podcast, Laura Dormer (Editor, The Evidence Base) speaks with Aaron Cohen (Senior Medical Director, Machine Learning, Flatiron Health) about the potential of large language models (LLMs) to extract details from electronic health records (EHRs) for cancer research. They also delve into the poster ‘Using large language models to extract PD-L1 testing details from electronic health records’ presented at ISPOR 2024 (May 6–8, 2024; Atlanta, GA, USA)

This research focused on using LLMs to extract PD-L1 testing details from EHRs. PD-L1 is a biomarker that helps determine cancer treatment, and extracting details about it from EHRs is challenging due to variations in formatting and reporting.

The research compared two approaches:

Zero-shot (without additional training)
Fine-tuning (with additional training on specific examples)

Fine-tuned LLMs outperformed the zero-shot approach and a deep learning model baseline, requiring fewer training examples. This suggests that LLMs have potential for EHR data extraction, but high-quality labeled data is crucial for training and ensuring accuracy.

Future research will explore generalizability of learnings across different biomarkers and operationalizing these LLMs for real-world use.

Chapters and timestampsTranscriptSpeaker

Podcast: Using large language models to extract PD-L1 testing details from electronic health records

Navigate to the corresponding timestamp in the podcast to view individual sections. For a text version of the podcast, see the ‘Transcripts’ tab above.

00:00: Introduction

01:07: What were the main objectives of your study?

03:09: Why were PD-L1 testing details selected as the focus of the research?

04:24: Can you describe the two approaches – zero-shot experiments and fine-tuning – that were used?

05:52: What were the key findings of your study?

10:07: How did the performance of fine-tuned LLMs compare with that of the deep learning model baseline?

11:17: What are the implications of these findings on the future use of LLMs in curating RWD?

12:44: What future research will be needed to take this work forward?

Introduction
What were the main objectives of your study?
Why were PD-L1 testing details selected as the focus of the research?
Can you describe the two approaches – zero-shot experiments and fine-tuning – that were used?
What were the key findings of your study?
How did the performance of fine-tuned LLMs compare with that of the deep learning model baseline?
What are the implications of these findings on the future use of LLMs in curating RWD?
What future research will be needed to take this work forward?

LAURA: Hello, everyone. I’m Laura Dormer and I’m the Editor of The Evidence Base. I’m pleased to be joined today by Aaron Cohen from Flatiron Health. Welcome, Aaron. To kick things off, please, could you introduce yourself and tell us a little bit about your role and your background?

AARON: Yeah, happy to. And thanks for the opportunity to be here. My name is Aaron Cohen. I’m a Senior Medical Director at Flatiron Health. I’m a medical oncologist by training. I still practice a day a week. I also have a background in clinical informatics. And at Flatiron, I’m head of clinical data for research in oncology and serve as the clinical lead for our machine learning team. And in that role, most of my interest in focus is thinking about how we can use new technologies such as natural language processing, machine learning – increasingly, large language models, which we’ll be talking a bit about today – to facilitate research and ultimately benefit patients with cancer.

What were the main objectives of your study?

LAURA: You’ve joined us today to talk about the research that you presented at the recent ISPOR 2024 conference using large language models to extract PD-L1 testing details from electronic health records. Could you start us off by explaining what were the main objectives of your study?

AARON: First, we should take a step back for why we want to do a study like that. Because I think that there’s a lot of talk about the promise of precision medicine being able to give the right treatment to the right patient, find the right patient for a clinical trial, make sure that I’m ordering the test that I’m supposed to and not forgetting anything. But in order to do all of that, we need clinical data to guide us. An algorithm isn’t going to be able to match a patient to a lung cancer trial if it doesn’t know that patient has lung cancer. And so many of these details are in the electronic health record, and they’re in the form of free texts. For me, typing a note when I’m seeing a patient – physician charting – it’s just not readily analyzable. And so there’s a lot of value in being able to automatically extract these clinical details from the EHR. And when I mean, say, clinical details, I mean things like the diagnosis, like lung cancer, the date that it happened, or the date that a patient progressed.

All of those are hugely important for being able to answer research questions and help at the Point-of-Care. But more important than that, being able to have that information in a structured format that can be analyzable is massively important. With all that in mind, and with all the huge advancements that we’ve seen in large language models’ ability to understand and to generate text from very complex tasks, we wanted to explore how these powerful tools might be able to do that very thing.

Basically, extract clinical details at scale across the electronic health record or EHR.

Why were PD-L1 testing details selected as the focus of the research?

LAURA: And why were PD-L1 testing details selected as the focus for your research?

AARON: PD-L1 is a very important biomarker. It’s a receptor that can be expressed on cancer cells. And it’s what immune checkpoint inhibitors bind to in order to take the brakes off of the immune system so that hopefully the immune system will be able to attack and kill the cancer.

There’s been a lot of treatments that have been approved across a lot of different cancer types based on the amount of PD-L1 that’s expressed on cancer cells. And knowing these details about PD-L1 helps guide treatment and management. So it’s important in that way.

But interestingly, the way that PD-L1 results have been reported has changed a lot over the last 10 years. And a lot of that is because our understanding of the science behind it has changed a lot during that time.

And so that combined with the fact that how it can be documented varies by cancer type as well as my earlier point that often it is recorded in an unstructured format – so imagine a PDF document with the information in it – all those things make it a particularly challenging biomarker to extract.

Can you describe the two approaches – zero-shot experiments and fine-tuning – that were used?

LAURA: And can you describe the two approaches? The zero-shot experiments and fine-tuning that were used in your work.

AARON: Yeah, absolutely. So going back to the goal of what we wanted to understand, which was can LLMs extract clinical details from the chart? We specifically wanted to evaluate whether commercially available LLMs could do this right away based on the huge amount of data that they’ve been trained on previously. We’ll call that kind of out-of-the-box approach a zero-shot approach. Or if they first needed to be fine-tuned? Which you can think of as basically giving additional training to these large language models using specific labeled examples that show the LLM how you want a specific task to be done. And so we applied these two approaches against two commercially available LLMs, Llama2 and Mistral, with the goal of extracting seven biomarkers related to PD-L1. And those biomarker details ranged from things like dates – so what was the test date or what was the result date? – to the results themselves. Like percent staining or staining intensity.

What were the key findings of your study?

LAURA: What were the key findings of your study?

AARON: For both methods, again, we wanted to make sure that the LLM could pull out the details for PD-L1 results, but we specifically wanted to make sure that they could pull them out in a usable way. In this case, in a JSON format. Which is usable for research right out of the gates, it’s easy for humans to understand, It’s easy for data exchange and analysis. It isn’t really helpful if an LLM pulls out a huge summary of text that contains the answer in it, but you can’t do anything with that text because you’re in the same place that you were with it being in an unstructured format to begin with. And so that was our first goal with asking these LLMs to perform these clinical detail extraction tasks. And what we found was that with the zero-shot method – again, that’s without any pre-training using our fine-tuned examples – we found that it just could not do that despite numerous prompts and exercises. And when I say couldn’t do that, what I mean is that it frequently outputted those long summaries of text, which of course, isn’t readily usable. And there are also hallucinations that it was exhibiting. And I know that term is thrown around a lot when we talk about LLMs. But just to be specific, what we mean there is that the LLMs were providing information on clinical details that weren’t even in the chart. So it was making something up, but it just sounded confident in doing so. Or it would just read back something that we had prompted it with. So just repeating back things that were in the initial ask of the LLM.

That’s what we found initially with the zero-shot approach. We didn’t make much progress there. With the fine-tuned method, however, we were able to do this in the right JSON format, and we were able to see good performance in getting the right answers. And just a couple things to make note of in regard to the results. First, the fine-tuned LLMs were able to extract all seven of these biomarker details at once. And that’s important because it’s just efficient. Finding all the answers that you want related to a biomarker or any clinical concept at once without needing to ask each individual question is very valuable.

The second thing is that, as I mentioned, performance was good. We looked at metrics – in this case, F1 scores, which you can think of as an average of sensitivity and positive predictive value. And we saw that the F1 scores were consistently high across all seven of these clinical details. And specifically, including dates, which can actually be a very challenging detail to extract if you think about all the different dates that show up in a chart, how dates can be referenced in the future or in the past. So that was something that was really nice to see.

And I’d say that the last main takeaway that was, we think, really important and meaningful, is that it did not take many training examples to have these LLMs learn to do the task well. And just to provide a little bit of context there. For some of our more traditional natural language processing-based deep learning extraction models – we’ve seen that these models require over 10,000 examples or labels to be trained to do something. And in this task with the LLMs and fine-tuning, we were seeing that we were getting good results with as few as 500 to 1000. So many, many fewer labels required. And related to that, we also saw that performance improved if the LLM was trained on those same examples multiple times., so showing the examples to the LLM, and then seeing what the results were and then showing those same examples again to the LLM. So without needing to increase the number of training examples. And what that really highlighted to us is that the model has the ability to learn and refine its skills over time even if you don’t have many labels to work with. Which is important because oftentimes you may not. So I think I would say those are the main key takeaways.

How did the performance of fine-tuned LLMs compare with that of the deep learning model baseline?

LAURA: And how did the performance of the fine-tuned LLMs compare with that of the deep learning model baseline?

AARON: I forgot to mention that one of the things that we wanted to do is compare how these LLMs did in extracting PD-L1 details compared to NLP-based deep learning methods. Which again, as I referenced earlier, require many more training examples. Over 10,000. And so we specifically looked at these two modeling approaches’ ability to extract PD-L1 percent staining – so one of the seven clinical details that we were interested in extracting – and compare performance across these two different approaches.

And what we saw was that the fine-tuned LLMs outperformed the deep learning model. Again, despite this huge disparity in training data. That was very exciting to see, and I think, speaks to the promise of these tools for accelerating data curation for research and Point-of-Care work moving forward.

What are the implications of these findings on the future use of LLMs in curating RWD?

LAURA: What are the implications of these findings on the future use of LLMs in curating real world data?

AARON: First, the findings show that to truly unlock the potential of these powerful LLMS, at least right now, you really do need high-quality labeled data to help them learn the specific tasks that you want them to learn. Remember that the zero-shot approach without any fine-tuning wasn’t really able to do these tasks. And we needed high-quality labeled examples to teach these LLMs what we wanted them to do.

And I’d say the second main takeaway, which is related to the first, is that as we look forward to the development of these new versions of LLMs and they become more and more powerful, it’s just really going to be essential that we thoroughly evaluate how they’re performing and their accuracy with their answers because they can sound really confident.

And as I mentioned before, with the hallucinations, they can say something that sounds right but just wasn’t there at all. And it would be hard to know that because it just makes so much clinical sense how it’s sounding.

So being able to verify and assess performance, again, with access to high-quality labeled data to do that against is going to be really important.

What future research will be needed to take this work forward?

LAURA: And then looking forward, what future research will be needed to move this work on?

AARON: For this study, we trained LLMs specifically to extract PD-L1 biomarker details. And we did that by using PD-L1 labeled examples to teach the LLM to do that. Or to fine-tune it. But what’s going to really be interesting looking forward is to explore how well these models can take their learnings from being trained on one biomarker and extrapolate them to extract details on another.

And so, specifically, what I mean is if we wanted to extract details on a different biomarker such as EGFR, but we didn’t have any training examples for EGFR, could our having trained the LLM using PD-L1 biomarkers lead the LLM to learn how to do that for EGFR or a different biomarker? Will it automatically carry over the learnings from PD-L1 to these other biomarkers and tasks?

And so I think that there’s a lot to learn about how generalizable these learnings end up being for the LLMs, and how many specific training examples you need to have to change tasks for an LLM. And again, we talked about how promising these LLMs are for extracting all this detailed information, all this complex clinical information, and how exciting that is to think about how having access to these details will accelerate research and improve care for patients.

But again, taking a step back, there’s still a lot of work to do to figure out how to actually operationalize all of that. So how do you actually take these clinical details and use them to actually help physicians provide better care when they’re seeing patients? How do you actually take this information and have it lead to improved enrollment in clinical trials? How do you ensure that potential mistakes that these confident sounding LLMs are making don’t end up leading to incorrect decisions, incorrect decisions that could potentially harm patients? And so, again, while all of this is very exciting and promising, it is the future, there’s still a lot of work to do to understand what this is going to lead to.

LAURA: Perfect. Thank you Aaron for talking to us today. I think this is a really fascinating area, and it will be interesting to see where it goes from here. And thank you to our viewers for joining us on The Evidence Base.

The transcript of the video has been lightly edited for clarity.

Speaker

Aaron Cohen
Senior Medical Director, Machine Learning, Flatiron Health

Aaron is a medical oncologist, clinical informatician, and health outcomes researcher who serves as senior medical director at Flatiron Health. Aaron serves as the Head of Clinical Data for Research Oncology and is the clinical lead for the machine learning team. He focuses on the use of natural language processing, machine learning, and LLMs to facilitate cancer research and care.

Aaron has a faculty appointment at the NYU School of Medicine in the Department of Medicine and maintains a clinical practice at Bellevue Hospital, with a focus on solid organ malignancies. Prior to Flatiron, Aaron attended the University of Pennsylvania, where he completed his undergraduate and medical school training, as well as his Master of Science in Clinical Epidemiology. He completed his residency in internal medicine and fellowship in hematology and oncology at the Hospital of the University of Pennsylvania and is board certified in internal medicine, medical oncology, and clinical informatics.

You may also be interested in:

Video: Evaluation of real-world response rate in clinical trial-aligned cohorts of patients with lung, colorectal, and breast cancer using machine learning

Disclosures:

The opinions expressed in this feature are those of the interviewee/author and do not necessarily reflect the views of The Evidence Base® or Becaris Publishing Ltd.

Sponsorship for this interview was provided by Flatiron Health. For more information on Flatiron Health, click here.

Podcast: Using large language models to extract PD-L1 testing details from electronic health records

Podcast: Using large language models to extract PD-L1 testing details from electronic health records

Jump to:

What were the main objectives of your study?

Why were PD-L1 testing details selected as the focus of the research?

Can you describe the two approaches – zero-shot experiments and fine-tuning – that were used?

What were the key findings of your study?

How did the performance of fine-tuned LLMs compare with that of the deep learning model baseline?

What are the implications of these findings on the future use of LLMs in curating RWD?

What future research will be needed to take this work forward?

Speaker

Data completeness: a podcast with Dan Drozd

Integrating patient-relevant outcomes into clinical practice: an interview with Donna Messner

Opioid use and chronic pain: a podcast with Hance Clarke

Augmenting the randomized controlled trial with real-world data: an interview with Michael Moran

The vitality of vital statistics: an interview with Alan Lopez

Podcast: Using large language models to extract PD-L1 testing details from electronic health records

Office info

Podcast: Using large language models to extract PD-L1 testing details from electronic health records

Jump to:

What were the main objectives of your study?

Why were PD-L1 testing details selected as the focus of the research?

Can you describe the two approaches – zero-shot experiments and fine-tuning – that were used?

What were the key findings of your study?

How did the performance of fine-tuned LLMs compare with that of the deep learning model baseline?

What are the implications of these findings on the future use of LLMs in curating RWD?

What future research will be needed to take this work forward?

Speaker

What's Trending?

Data completeness: a podcast with Dan Drozd

Integrating patient-relevant outcomes into clinical practice: an interview with Donna Messner

Opioid use and chronic pain: a podcast with Hance Clarke

Augmenting the randomized controlled trial with real-world data: an interview with Michael Moran

The vitality of vital statistics: an interview with Alan Lopez

Podcast: Using large language models to extract PD-L1 testing details from electronic health records