🏆 EvalPlus Leaderboard 🏆

EvalPlus evaluates AI Coders with rigorous tests.

📢 News: Beyond correctness, how's their code efficiency? Checkout 🚀EvalPerf!

github paper

📝 Notes

  1. Evaluated using HumanEval+ version 0.1.10; MBPP+ version 0.2.0.
  2. Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
  3. ✨ marks models evaluated using a chat setting, while others perform direct code completion.
  4. Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed (e.g., test_list is not wrong).
  5. Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
  6. 💚 means open weights and open data. 💙 means open weights and open SFT data, but the base model is not data-open. What does this imply? 💚💙 models open-source the data such that one can concretely reason about contamination.
  7. "Size" here is the amount of activated model weight during inference.

🤗 More Leaderboards

In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: