MedEvalArena

Introduction

Large Language Models have shown strong performance in medical question answering, but their capabilities in complex clinical reasoning remain difficult to characterize systematically. We present MedEvalArena, a dynamic evaluation framework designed to compare medical reasoning robustness across models using a symmetric, round-robin protocol.

In MedEvalArena, each model generates medical questions intended to challenge the medical reasoning abilities of other models. Question validity is assesed using an LLM-as-judge paradigm along two axes, logical correctness and medical accuracy. Questions that the majority of the LLM-as-judge ensemble finds valid 'pass', and are used in the exam stage, where each LLM takes all the valid questions tha were generated. The top 6 model families on https://artificialanalysis.ai/ (model cutoff Nov 15, 2025) were included as question generators and on the LLM-as-judge ensemble. LLM generators generated questions until a quota of 50 total valid questions were generated per LLM for a grand total of 300 questions.

MedEvalArena provides a dynamic and scalable framework for benchmarking LLM medical reasoning.

Leaderboard

Question validity (pass rate), only models that served as question gnerators and on the LLM-as-judge ensemble are shown

Accuracy

Cost per evaluation

Accuracy vs Cost per evaluation

Up to Top-10 models by accuracy shown. Each evaluation contains 300 questions (50 questions generated per LLM).

🏟️ Results

Default sort is by Mean Accuracy (descending). Top-3 entries are marked with 🥇🥈🥉. Validity refers to pass-rate.

Models 0

Generated 2026-02-07T16:30:01Z

#	Model	Mean Accuracy	SEM	Validity
1	claude-opus-4-5-20251101 🥇	91.67%	1.73%	83.30%
2	gpt-5.1-2025-11-13 🥈	88.67%	1.24%	94.80%
3	deepseek-v3.2 🥉	88.33%	1.92%	46.00%
4	grok-4-0709	88.00%	1.82%	56.00%
5	gemini-3-pro-preview	87.00%	1.87%	92.30%
6	grok-4-1-fast-reasoning	86.33%	2.17%	—
7	kimi-k2-thinking	85.67%	1.64%	63.80%
8	gpt-5-nano-2025-08-07	81.00%	2.32%	—
9	claude-haiku-4-5-20251001	80.33%	2.27%	—
10	gemini-2.5-flash-lite	76.00%	2.23%	—

Tip: click a column header to sort.

📬 Contact

For questions, please open a GitHub issue on the repository.

Citation

Prem, P., Shidara, K., Kuppa, V., Wheeler, E., Liu, F., Alaa, A., & Bernardo, D. (2026). MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning. medRxiv. https://doi.org/10.64898/2026.01.27.26344905

BibTeX

@article{prem2026medevaluarena,
  title   = {MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning},
  author  = {Prem, Preethi and Shidara, Kie and Kuppa, Vikasini and Wheeler, Esm{\'e} and Liu, Feng and Alaa, Ahmed and Bernardo, Danilo},
  journal = {medRxiv},
  year    = {2026},
  date    = {2026-01-27},
  doi     = {10.64898/2026.01.27.26344905},
  url     = {https://doi.org/10.64898/2026.01.27.26344905},
  note    = {Preprint}
}