Toloka Team
Toloka's new LLM Leaderboard: Finding the best model for your business
Imagine you're building a chatbot to answer investment questions for banking customers. Before you can start fine-tuning a Large Language Model, you need to research and decide: which of the models will be best suited to drive your AI tool?
To create your shortlist, you'll probably compare benchmarks like model size, inference options, latency, and costs. But ultimately you want to know which large language models will meet your expectations and serve your customers responsibly. You certainly don't want your investment chatbot to mislead anyone with murky financial advice, or risk sounding unfriendly or annoying. How can you effectively compare model performance on relevant business topics?
Toloka's new LLM Leaderboard is an excellent starting point. Our open LLM ranking zeroes in on what's essential for reliable and relevant model comparisons.
We compare the 5 most popular large language models for maximum efficiency: WizardLM 13B V1.2, LLaMA 2 70B Chat, Vicuna 33B V1.3, GPT-4, and GPT-3.5 Turbo
We use unique organic prompts to evaluate the models in different fields: brainstorming, closed Q&A, open Q&A, text generation, and rewriting
We use human experts — the gold standard — to accurately assess and rate the model output
Curious about the rankings? Check out more the Toloka LLM Leaderboard here:
Keep reading to learn how we rank models and why it's important.
The evaluation method for large language models
Our model evaluation process has two stages: prompt collection and human evaluation.
Stage 1: Prompt collection
We collected our own dataset of high-quality prompts to use for LLM evaluation.
Other evaluation methods use open-source resources, but they are not reliable enough for accurate evaluation. Using open-source datasets can be restrictive for several reasons:Using open-source datasets can be restrictive for several reasons:
Many open LLM prompts are too generic and do not reflect the needs of a business looking to implement an LLM.
The range of tasks the open-source prompts cover might be broad but the distribution is skewed towards certain topics that are not necessarily the most relevant for downstream applications.
It is virtually impossible to guarantee that the dataset was not leaked and the open-source prompts were not included in the training data of the existing LLMs.
To mitigate these issues, we collected original prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT).
In these examples, you can see how the large language models respond differently to the same prompts.
Prompt: What does 'burning the midnight oil' mean?
Prompt: What is 5x5?
Prompt: What is an espresso tonic? What does it consist of?
These prompts serve as the cornerstone to accurate model evaluation — we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets. We store the dataset securely and reserve it solely for use in this particular evaluation.
After collecting the prompts, we manually classified them by category and got the following distribution:
We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as HumanEval.
Stage 2: Human evaluation
Human evaluation of prompts was conducted by Toloka's domain experts.
Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the harmlessness, truthfulness, and helpfulness principles.
In other words, each model was compared to the same baseline model, rather than comparing each model to every other competitor model. Then we calculated the percentage of prompts where humans preferred the tested model's output over the baseline model's output (this is called the model's win rate). The leaderboard shows results in each category, as well as the average score overall for each of the tested models.
Most importantly, we ensured the accuracy of human judgments by using advanced quality control techniques:
Annotator onboarding with rigorous qualification tests to choose, certify experts and check their performance on evaluation tasks.
Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
Monitoring individual accuracy by comparing each expert's results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
Ready to compare LLMs?
Find your AI application's use case categories on our LLM leaderboard and see how the models stack up. It never hurts to check more open llm leaderboards (Hugging Face Open LLM Leaderboard, LMSYS, or others) for the big picture before you pick a model and start experimenting.
If you're interested in comparing more LLMs using our experts, or you need reliable evaluation of your model, we have the tools and resources you need.
Reach out to our team to learn how Toloka can help you achieve the quality insights you're looking for.
Article written by:
Toloka Team
Updated:
Nov 1, 2023