Evaluating LLM reasoning in Russian law.
Lexometrica Ground Truth is an independent LLM leaderboard built on a closed, static dataset of 30 highly complex cases derived from real Russian court practice. We discard standard memorization metrics to test real legal intelligence within the IRAC (Issue, Rule, Application, Conclusion) logical framework: evaluating the models' ability to identify hidden problems (issue-spotting), apply relevant norms to facts (rule-application), and draw accurate conclusions. The benchmark rigorously evaluates the correctness of the final decision against an expert rubric, the mandatory citation of Russian statutory law, and resilience against the Safety Paradox (over-refusal on legitimate legal queries)
lexometrica-legal-ru-v1 / March 2026Current Leaderboard (March 2026).
The Primary Score reflects the core quality of legal reasoning. The Composite Score serves as the final ranking metric, penalizing models for false refusals (Safety Paradox) while rewarding the structural accuracy of legal citations.
| Rank | Provider | Model | Primary Score | Safety Paradox | Citations OK | Composite Score |
|---|---|---|---|---|---|---|
| 1 | OpenAI | GPT-5.4 Pro | 0.90 | 0% | 100% | 0.90 |
| 2 | Anthropic | Claude Opus 4.6 | 0.85 | 0% | 100% | 0.85 |
| 3 | Gemini 3.1 Pro | 0.63 | 0% | 87% | 0.62 | |
| 4 | Alibaba | Qwen3.5 Plus 02-15 | 0.60 | 0% | 100% | 0.60 |
| 5 | Z.ai | GLM 5 | 0.57 | 0% | 97% | 0.57 |
| 6 | MoonshotAI | Kimi K2.5 | 0.46 | 0% | 100% | 0.46 |
| 7 | DeepSeek | DeepSeek V3.2 | 0.43 | 0% | 100% | 0.43 |
| 8 | Sber | GigaChat 2 Max | 0.41 | 0% | 90% | 0.40 |
| 9 | MiniMax | MiniMax M2.5 | 0.36 | 0% | 100% | 0.36 |
| 10 | Yandex | YandexGPT Pro 5.1 | 0.23 | 7% | 87% | 0.23 |
Benchmark by cognitive vector
Benchmark tasks are grouped by cognitive vector — each task belongs to one dimension. Breakdown: rule application (mapping norms to case facts), rule recall (correct citation), rule conclusion (accuracy of the final decision), issue-spotting (identifying hidden legal questions), and interpretation (construing norms).
| Model | Rule Application | Rule Recall | Rule Conclusion | Issue Spotting | Interpretation |
|---|---|---|---|---|---|
| GPT-5.4 Pro | 0.80 | 1.00 | 0.98 | 1.00 | 0.75 |
| Claude Opus 4.6 | 0.74 | 1.00 | 0.98 | 0.85 | 0.90 |
| Gemini 3.1 Pro | 0.62 | 0.99 | 0.47 | 0.59 | 0.55 |
| Qwen3.5 Plus 02-15 | 0.62 | 0.25 | 0.85 | 0.48 | 1.00 |
| GLM 5 | 0.47 | 0.75 | 0.61 | 0.58 | 0.90 |
| Kimi K2.5 | 0.42 | 0.30 | 0.75 | 0.43 | 0.00 |
| DeepSeek V3.2 | 0.38 | 0.38 | 0.62 | 0.37 | 0.50 |
| GigaChat 2 Max | 0.45 | 0.33 | 0.42 | 0.40 | 0.20 |
| MiniMax M2.5 | 0.42 | 0.17 | 0.33 | 0.36 | 0.60 |
| YandexGPT Pro 5.1 | 0.22 | 0.45 | 0.17 | 0.13 | 0.60 |
How scores are calculated and weighted.
The baseline metric for legal reasoning quality. It is calculated as the case-level average of multi-step logical evaluation, combining manual expert review and strict LLM-as-a-judge assessments.
The percentage of cases where the model falsely triggered internal safety guardrails and refused to answer legitimate legal queries (e.g., "I cannot provide legal advice"). Higher percentages indicate critical systemic failure in professional environments.
The percentage of responses containing structurally correct and verifiable citations of Russian legal norms, specifically checking for precise references to Codes, Articles, and Federal Laws.
The definitive leaderboard metric. It is calculated using the following formula:
Primary Score × (1 − 0.2 × Safety Paradox) × (0.85 + 0.15 × Citations OK)