Medical LLM’s Benchmarking
The evaluation of the Medical LLM’s model using the USMLE Dataset involves assessing the model’s performance and accuracy in responding to medical questions.
Introduction
This section extensively evaluates the performance of Medical LLMS using the USMLE Dataset. Focused on assessing the model’s knowledge in addressing medical queries, our analysis rigorously evaluates the model’s performance. Serving as a benchmark within this specialized domain, the outcomes of this evaluation offer valuable insights to enhance language models for crucial medical applications.
Objectives
The objectives for Benchmarking are as follows.
- Benchmarking of models’ performance.
- Behavior of models on an Out of Distribution dataset US-Mili Dataset.
Dataset
The USMLE dataset, a component of MEDQA [1], comprises medical questions used for training, validating, and testing models. Each question presents four answer options. The dataset includes 10,178 questions for training, 1,272 for development, and 1,273 for testing, totaling 12,723 questions. The 100 recodes Test data is only used to do out of distribution testing.
Model Selection Criteria
In the development of medical applications, selecting the right models is crucial. This section explores the methods and criteria guiding the identification of optimal models, ensuring the Chatbot’s precision and integration within the medical domain. Criteria defining the model requirements for the benchmarking encompass the following key features.
- Fine-tuned on medical datasets.
- Model parameters within the range of 7B to 13B.
- Publicly available weights for the model.
- Results and papers cited in reputable research articles.
- A model trained on a big dataset and featuring a medical benchmark.
Models For the benchmarking
The selected models are listed in the table 1 below:
Models | # Params | Data Scale | Data Source |
ClinicalGPT [2] | 7B | 96k EHRs, 192k med QA, 100k dialogues | MD-EHR, VariousMedQA, MedDialog |
ChatDoctor [3] | 7B | 110k dialogues | HealthCareMagic |
MedAlpaca [4] | 7B/13B | 160k medical QA | Medical Meadow |
AlphaCare [5] | 13B | 52k instructions | MedInstruct-52k |
Meditron [6] | 7B | 48.1B tokens | Clinical Guidelines, PubMed |
LLAMA 2 [7] | 7B/13B | 2.0T | A new mix of publicly available online data |
Prompt Format
It is crucial to emphasize that each fine-tuned LLM model operates with a specific prompt format, generating output accordingly. Therefore, for benchmarking these models, the prompt format specified in their respective papers or repositories is utilized. The prompt format for each model is provided below. The implementation of the experiment can be found in the notebook.
ClinicalGPT
def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: return f””” Prompt: {system_prompt} {prompt} Response: “””.strip()
|
ChatDoctor
def generate_prompt(Questions: str, options: str = DEFAULT_SYSTEM_PROMPT) -> str: return f””” ### Instruction: You are a doctor, please answer the medical MCQS based on the given Question and options A B C D. ### Input: {Questions} {options} ### Response: “””. strip()
|
MedAlpaca
def generate_prompt(Questions: str, options: str = DEFAULT_SYSTEM_PROMPT) -> str: return f””” Context: Answer the medical MCQS based on the given Question and options A B C D. Question: {Questions} {options}
Answer: “””. strip()
|
AlphaCare
def generate_prompt(Questions: str, options = DEFAULT_SYSTEM_PROMPT) -> str: return f””” Human: Select the one output (a), (b), (c) or (d) that better matches the given instruction. example: # Example: ## Instruction: Give a description of the following job: “ophthalmologist” ## Output (a): An ophthalmologist is a medical doctor who specializes in the diagnosis and treatment of eye diseases and conditions. ## Output (b): An ophthalmologist is a medical doctor who pokes and prods at your eyes while asking you to read letters from a chart. ## Which is the better choice, Output (a) or Output (b), or is it a Tie? Output (a) Here the answer is Output (a) because it provides a comprehensive and accurate description of the job of an ophthalmologist. # Task: Now is the real task, do not explain your answer, just say Output (a) or Output (b). ## Instruction: You are medical expert answer the following question: {Questions} ## Output (a): {options[“A”]} ## Output (b): {options[“B”]} ## Output (c): {options[“C”]} ## Output (d): {options[“D”]} ## Which is the best choice, Output (a), Output (b), Output (c) or Output (d)? Assistant:”. strip()
|
MEDITRON
def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: return f””” ### User: {system_prompt} {prompt} ### Assistant: “””.strip()
|
LLAMA 2
def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: return f””” Prompt: {system_prompt} {prompt} Response: “””.strip() |
Testing of model’s performance
The evaluation involved 100 multiple-choice questions (MCQs) pertaining to pharmacy-related topics. The models subjected to testing include ClinicalGPT, MedAlpaca, AlphaCare, LLAMA 2 7B, LLAMA 2 13B and Meditron. Each model’s performance was assessed based on its ability to accurately answer these domain specific MCQs within the pharmacy field.
Evaluation metrics for the models
In evaluating LLMs for MCQs, accuracy was employed as the primary metric. Answers were treated as binary outcomes, either true or false, determined by the available multiple-choice options. This approach measured the model’s success in providing correct responses within the given set of questions.
Results
The evaluation of these models is carried out manually, with each response categorized as either “True,” “False,” or “Not Answered,” based on the ground truth provided in the dataset. The results are presented in Table 2, which showcases the outputs of the LLMs models as “True,” “False,” and “Not Answered,” with detailed data available in excel file. Figure 1 illustrates the performance accuracy of the models on the US-Mili dataset. Notably, LLAMA 2 emerges as the frontrunner, demonstrating superior performance, particularly on Out of Distribution Dataset.
Model Name | TRUE | FALSE | Not Answered |
llama 2 13b | 43 | 46 | 11 |
llama 2 7b | 32 | 56 | 12 |
ChatGPT | 26 | 74 | 0 |
MedAlpaca | 33 | 67 | 0 |
AlphaCare 7B | 23 | 25 | 52 |
ChatDocter_7B | 0 | 0 | 100 |
Meditron | 0 | 0 | 0 |
Table 1: Human Supervised Evaluation of the LLM’s.
Figure 2: Performance of Medical LLM’s on US-Mili Dataset.
Conclusion
In conclusion, the evaluation of Medical Language Models (LLMs) using the USMLE Dataset revealed insights into their performance in responding to medical-related multiple-choice questions. The models, including LLAMA 2 13B, LLAMA 2 7B, ClinicalGPT, MedAlpaca, AlphaCare 7B, and Meditron, demonstrated varying levels of accuracy. LLAMA 2 13B emerged as the most accurate model, although none achieved a majority in correct responses, emphasizing the inherent challenges in medical question answering. Notably, AlphaCare 7B faced difficulties, marked by a substantial proportion of questions categorized as Not Answered. Additionally, LLAMA 2 13B exhibited a higher capability in handling out-of-distribution data compared to LLAMA 2 7B, underlining the importance of evaluating model behavior beyond the training domain.
Hamza Farooq
ML Engineer