MedEvalArena introduction graphic

Introduction

Large Language Models have shown strong performance in medical question answering, but their capabilities in complex clinical reasoning remain difficult to characterize systematically. We present MedEvalArena, a dynamic evaluation framework designed to compare medical reasoning robustness across models using a symmetric, round-robin protocol.

In MedEvalArena, each model generates medical questions intended to challenge the medical reasoning abilities of other models. Question validity is assesed using an LLM-as-judge paradigm along two axes, logical correctness and medical accuracy. Questions that the majority of the LLM-as-judge ensemble finds valid 'pass', and are used in the exam stage, where each LLM takes all the valid questions tha were generated. The top 6 model families on https://artificialanalysis.ai/ (model cutoff Nov 15, 2025) were included as question generators and on the LLM-as-judge ensemble. LLM generators generated questions until a quota of 50 total valid questions were generated per LLM for a grand total of 300 questions.

MedEvalArena provides a dynamic and scalable framework for benchmarking LLM medical reasoning.

Read more analyses here: https://www.medrxiv.org/content/10.64898/2026.01.27.26344905v1

 Leaderboard

Question validity (pass rate), only models that served as question gnerators and on the LLM-as-judge ensemble are shown
Accuracy
Cost per evaluation
Accuracy vs Cost per evaluation

Up to Top-10 models by accuracy shown. Each evaluation contains 300 questions (50 questions generated per LLM).

🏟️ Results

Default sort is by Mean Accuracy (descending). Top-3 entries are marked with πŸ₯‡πŸ₯ˆπŸ₯‰. Validity refers to pass-rate.

Models 0
Generated 2026-02-07T16:30:01Z
# Model Mean Accuracy SEM Validity
1 claude-opus-4-5-20251101 πŸ₯‡ 91.67% 1.73% 83.30%
2 gpt-5.1-2025-11-13 πŸ₯ˆ 88.67% 1.24% 94.80%
3 deepseek-v3.2 πŸ₯‰ 88.33% 1.92% 46.00%
4 grok-4-0709 88.00% 1.82% 56.00%
5 gemini-3-pro-preview 87.00% 1.87% 92.30%
6 grok-4-1-fast-reasoning 86.33% 2.17% β€”
7 kimi-k2-thinking 85.67% 1.64% 63.80%
8 gpt-5-nano-2025-08-07 81.00% 2.32% β€”
9 claude-haiku-4-5-20251001 80.33% 2.27% β€”
10 gemini-2.5-flash-lite 76.00% 2.23% β€”

Tip: click a column header to sort.

πŸ“¬ Contact

For questions, please open a GitHub issue on the repository.

Citation

Prem, P., Shidara, K., Kuppa, V., Wheeler, E., Liu, F., Alaa, A., & Bernardo, D. (2026). MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning. medRxiv. https://doi.org/10.64898/2026.01.27.26344905

BibTeX

@article{prem2026medevaluarena,
  title   = {MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning},
  author  = {Prem, Preethi and Shidara, Kie and Kuppa, Vikasini and Wheeler, Esm{\'e} and Liu, Feng and Alaa, Ahmed and Bernardo, Danilo},
  journal = {medRxiv},
  year    = {2026},
  date    = {2026-01-27},
  doi     = {10.64898/2026.01.27.26344905},
  url     = {https://doi.org/10.64898/2026.01.27.26344905},
  note    = {Preprint}
}