Benchmarking of Medical LLMs

Medical LLM’s Benchmarking

The evaluation of the Medical LLM’s model using the USMLE Dataset involves assessing the model’s performance and accuracy in responding to medical questions.

Introduction

This section extensively evaluates the performance of Medical LLMS using the USMLE Dataset. Focused on assessing the model’s knowledge in addressing medical queries, our analysis rigorously evaluates the model’s performance. Serving as a benchmark within this specialized domain, the outcomes of this evaluation offer valuable insights to enhance language models for crucial medical applications.

Objectives

The objectives for Benchmarking are as follows.

Benchmarking of models’ performance.
Behavior of models on an Out of Distribution dataset US-Mili Dataset.

Dataset

The USMLE dataset, a component of MEDQA [1], comprises medical questions used for training, validating, and testing models. Each question presents four answer options. The dataset includes 10,178 questions for training, 1,272 for development, and 1,273 for testing, totaling 12,723 questions. The 100 recodes Test data is only used to do out of distribution testing.

Model Selection Criteria

In the development of medical applications, selecting the right models is crucial. This section explores the methods and criteria guiding the identification of optimal models, ensuring the Chatbot’s precision and integration within the medical domain. Criteria defining the model requirements for the benchmarking encompass the following key features.

Fine-tuned on medical datasets.
Model parameters within the range of 7B to 13B.
Publicly available weights for the model.
Results and papers cited in reputable research articles.
A model trained on a big dataset and featuring a medical benchmark.

Models For the benchmarking

The selected models are listed in the table 1 below:

Models	# Params	Data Scale	Data Source
ClinicalGPT [2]	7B	96k EHRs, 192k med QA, 100k dialogues	MD-EHR, VariousMedQA, MedDialog
ChatDoctor [3]	7B	110k dialogues	HealthCareMagic
MedAlpaca [4]	7B/13B	160k medical QA	Medical Meadow
AlphaCare [5]	13B	52k instructions	MedInstruct-52k
Meditron [6]	7B	48.1B tokens	Clinical Guidelines, PubMed
LLAMA 2 [7]	7B/13B	2.0T	A new mix of publicly available online data

Prompt Format

It is crucial to emphasize that each fine-tuned LLM model operates with a specific prompt format, generating output accordingly. Therefore, for benchmarking these models, the prompt format specified in their respective papers or repositories is utilized. The prompt format for each model is provided below. The implementation of the experiment can be found in the notebook.

ClinicalGPT

def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f”””

Prompt:

{system_prompt}

{prompt}

Response:

“””.strip()

ChatDoctor

def generate_prompt(Questions: str, options: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f”””

### Instruction:

You are a doctor, please answer the medical MCQS based on the given Question and options A B C D.

### Input:

{Questions}

{options}

### Response:

“””. strip()

MedAlpaca

def generate_prompt(Questions: str, options: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f”””

Context:

Answer the medical MCQS based on the given Question and options A B C D.

Question:

{Questions}

{options}

Answer:

“””. strip()

AlphaCare

def generate_prompt(Questions: str, options = DEFAULT_SYSTEM_PROMPT) -> str:

return f””” Human: Select the one output (a), (b), (c) or (d) that better matches the given instruction. example:

# Example:

## Instruction:

Give a description of the following job: “ophthalmologist”

## Output (a):

An ophthalmologist is a medical doctor who specializes in the diagnosis and treatment of eye diseases and conditions.

## Output (b):

An ophthalmologist is a medical doctor who pokes and prods at your eyes while asking you to read letters from a chart.

## Which is the better choice, Output (a) or Output (b), or is it a Tie?

Output (a)

Here the answer is Output (a) because it provides a comprehensive and accurate description of the job of an ophthalmologist.

# Task:

Now is the real task, do not explain your answer, just say Output (a) or Output (b).

## Instruction:

You are medical expert answer the following question:

{Questions}

## Output (a):

{options[“A”]}

## Output (b):

{options[“B”]}

## Output (c):

{options[“C”]}

## Output (d):

{options[“D”]}

## Which is the best choice, Output (a), Output (b), Output (c) or Output (d)?

Assistant:”. strip()

MEDITRON

def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f”””

### User:

{system_prompt}

{prompt}

### Assistant:

“””.strip()

LLAMA 2

def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f”””

Prompt:

{system_prompt}

{prompt}

Response:

“””.strip()

Testing of model’s performance

The evaluation involved 100 multiple-choice questions (MCQs) pertaining to pharmacy-related topics. The models subjected to testing include ClinicalGPT, MedAlpaca, AlphaCare, LLAMA 2 7B, LLAMA 2 13B and Meditron. Each model’s performance was assessed based on its ability to accurately answer these domain specific MCQs within the pharmacy field.

Evaluation metrics for the models

In evaluating LLMs for MCQs, accuracy was employed as the primary metric. Answers were treated as binary outcomes, either true or false, determined by the available multiple-choice options. This approach measured the model’s success in providing correct responses within the given set of questions.

Results

The evaluation of these models is carried out manually, with each response categorized as either “True,” “False,” or “Not Answered,” based on the ground truth provided in the dataset. The results are presented in Table 2, which showcases the outputs of the LLMs models as “True,” “False,” and “Not Answered,” with detailed data available in excel file. Figure 1 illustrates the performance accuracy of the models on the US-Mili dataset. Notably, LLAMA 2 emerges as the frontrunner, demonstrating superior performance, particularly on Out of Distribution Dataset.

Model Name	TRUE	FALSE	Not Answered
llama 2 13b	43	46	11
llama 2 7b	32	56	12
ChatGPT	26	74	0
MedAlpaca	33	67	0
AlphaCare 7B	23	25	52
ChatDocter_7B	0	0	100
Meditron	0	0	0

Table 1: Human Supervised Evaluation of the LLM’s.

Figure 2: Performance of Medical LLM’s on US-Mili Dataset.

Conclusion

In conclusion, the evaluation of Medical Language Models (LLMs) using the USMLE Dataset revealed insights into their performance in responding to medical-related multiple-choice questions. The models, including LLAMA 2 13B, LLAMA 2 7B, ClinicalGPT, MedAlpaca, AlphaCare 7B, and Meditron, demonstrated varying levels of accuracy. LLAMA 2 13B emerged as the most accurate model, although none achieved a majority in correct responses, emphasizing the inherent challenges in medical question answering. Notably, AlphaCare 7B faced difficulties, marked by a substantial proportion of questions categorized as Not Answered. Additionally, LLAMA 2 13B exhibited a higher capability in handling out-of-distribution data compared to LLAMA 2 7B, underlining the importance of evaluating model behavior beyond the training domain.

Benchmarking of Medical LLMs

Medical LLM’s Benchmarking

Introduction

Objectives

Dataset

Model Selection Criteria

Models For the benchmarking

Prompt Format

Testing of model’s performance

Evaluation metrics for the models

Results

Conclusion

Hamza Farooq

Leave a Reply Cancel reply

Home

Our Services

Our Products

Company