LLM Leaderboard

The first public LLM comparison based on authentic  
user prompts and expert human evaluation.

Leaders by category

The leaderboard ranks LLMs across multiple prompt categories.

Closed QA
Open QA
2WizardLM 13B V1.279.5676.9273.6880.3177.1192.31
3LLaMA 2 70B Chat78.9788.4668.4281.3569.8884.62
4GPT-3.5 Turbo76.7973.0868.4280.3173.4976.92
5Vicuna 33B V1.374.2182.0547.3771.5070.4876.92
6Guanaco 13B50.0050.0050.0050.0050.0050.00
  • Image
    Authentic user prompts
    Our prompts are extracted from real conversations with ChatGPT
  • Image
    Accurate human evaluation
    Expert human assessments are quality-controlled for best accuracy
  • Image
    Practical comparisons
    Comparison of the top 5 LLMs by category for business decision-making
Learn more about our evaluation process

How evaluation works

Toloka compares and ranks the most popular LLMs in multiple categories, using Guanaco 13B as the baseline.

  • If you’re choosing a model for business applications, you want to compare model output on realistic examples.Toloka’s goal is to measure human preferences for LLM output. Our prompts are extracted from real conversations with ChatGPT, and expert human assessments are quality- controlled for best accuracy.The Toloka LLM Leaderboard gives you:
    • Human preferences relevant to downstream applications
    • Comparison of the top 5 LLMs by category for practical business decisions
    • The most accurate human evaluation available
    More details about our evaluation process:
  • Quality control includes 3 stages:
    • Human experts are tested, trained, and certified to perform evaluation tasks with specific guidelines on harmlessness, truthfulness, and helpfulness of model responses.
    • To make ratings objective, we use an overlap of 3 with Dawid-Skene aggregation, so each comparison is evaluated by 3 experts and aggregated to achieve a single verdict.
    • Each expert’s individual accuracy is continually monitored by comparing their judgments with the majority vote.
    Unlike other leaderboards, we do not use crowdsourced ratings or LLM-generated ratings.
  • We collect user prompts written for ChatGPT, run the models on these prompts, and use human evaluation to score the responses. Then we calculate the percentage of prompts where the model scored better than the baseline (Guanaco 13B).
  • We can develop custom evaluations. Please reach out to us at
    We'd be happy to discuss your evaluation needs.
  • We select popular models from the Hugging Face Hub.