Large Language Models have shown strong performance in medical question answering, but their capabilities in complex clinical reasoning remain difficult to characterize systematically. We present MedEvalArena, a dynamic evaluation framework designed to compare medical reasoning robustness across models using a symmetric, round-robin protocol.
In MedEvalArena, each model generates medical questions intended to challenge the medical reasoning abilities of other models. Question validity is assesed using an LLM-as-judge paradigm along two axes, logical correctness and medical accuracy. Questions that the majority of the LLM-as-judge ensemble finds valid 'pass', and are used in the exam stage, where each LLM takes all the valid questions tha were generated. The top 6 model families on https://artificialanalysis.ai/ (model cutoff Nov 15, 2025) were included as question generators and on the LLM-as-judge ensemble. LLM generators generated questions until a quota of 50 total valid questions were generated per LLM for a grand total of 300 questions.
MedEvalArena provides a dynamic and scalable framework for benchmarking LLM medical reasoning.
Read more analyses here: https://www.medrxiv.org/content/10.64898/2026.01.27.26344905v1
Up to Top-10 models by accuracy shown. Each evaluation contains 300 questions (50 questions generated per LLM).
Default sort is by Mean Accuracy (descending). Top-3 entries are marked with π₯π₯π₯. Validity refers to pass-rate.
| # | Model | Mean Accuracy | SEM | Validity |
|---|---|---|---|---|
| 1 | claude-opus-4-5-20251101 π₯ | 91.67% | 1.73% | 83.30% |
| 2 | gpt-5.1-2025-11-13 π₯ | 88.67% | 1.24% | 94.80% |
| 3 | deepseek-v3.2 π₯ | 88.33% | 1.92% | 46.00% |
| 4 | grok-4-0709 | 88.00% | 1.82% | 56.00% |
| 5 | gemini-3-pro-preview | 87.00% | 1.87% | 92.30% |
| 6 | grok-4-1-fast-reasoning | 86.33% | 2.17% | β |
| 7 | kimi-k2-thinking | 85.67% | 1.64% | 63.80% |
| 8 | gpt-5-nano-2025-08-07 | 81.00% | 2.32% | β |
| 9 | claude-haiku-4-5-20251001 | 80.33% | 2.27% | β |
| 10 | gemini-2.5-flash-lite | 76.00% | 2.23% | β |
Tip: click a column header to sort.
For questions, please open a GitHub issue on the repository.
Prem, P., Shidara, K., Kuppa, V., Wheeler, E., Liu, F., Alaa, A., & Bernardo, D. (2026). MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning. medRxiv. https://doi.org/10.64898/2026.01.27.26344905
@article{prem2026medevaluarena,
title = {MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning},
author = {Prem, Preethi and Shidara, Kie and Kuppa, Vikasini and Wheeler, Esm{\'e} and Liu, Feng and Alaa, Ahmed and Bernardo, Danilo},
journal = {medRxiv},
year = {2026},
date = {2026-01-27},
doi = {10.64898/2026.01.27.26344905},
url = {https://doi.org/10.64898/2026.01.27.26344905},
note = {Preprint}
}