Toloka's new LLM Leaderboard: Finding the best model for your business

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

Imagine you're building a chatbot to answer investment questions for banking customers. Before you can start fine-tuning a foundation model, you need to make a big decision: which Large Language Model will be best suited to drive your AI tool?

To create your shortlist, you'll probably compare a host of factors like model size, inference options, latency, and costs. But ultimately you want to know which model will meet your expectations and serve your customers responsibly. You certainly don't want your investment chatbot to mislead anyone with murky financial advice, or risk sounding unfriendly or annoying. How can you effectively compare model performance on relevant business cases?

Toloka's new LLM Leaderboard is an excellent starting point. Our LLM ranking zeroes in on what's essential for reliable and relevant model comparisons.

  • We compare the 5 most popular models for maximum efficiency: WizardLM 13B V1.2, LLaMA 2 70B Chat, Vicuna 33B V1.3, GPT-4, and GPT-3.5 Turbo
  • We use unique organic prompts to evaluate the models in practical categories: brainstorming, closed Q&A, open Q&A, text generation, and rewriting
  • We use human experts — the gold standard — to accurately assess and rate the model output

Curious about the rankings? Check out the Toloka LLM Leaderboard here:

LLM Leaderboard

Keep reading to learn how we compare models and why it's important.

The evaluation method

Our model evaluation process has two stages: prompt collection and human evaluation.

Stage 1: Prompt collection

We collected our own dataset of high-quality prompts to use for LLM evaluation.

Other evaluation methods use open-source prompts, but they are not reliable enough for accurate evaluation. Using open-source datasets can be restrictive for several reasons:

  1. Many open-source prompts are too generic and do not reflect the needs of a business looking to implement an LLM.
  2. The range of tasks the open-source prompts cover might be broad but the distribution is skewed towards certain tasks that are not necessarily the most relevant for downstream applications.
  3. It is virtually impossible to guarantee that the dataset was not leaked and the open-source prompts were not included in the training data of the existing LLMs.

To mitigate these issues, we collected original prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT).

In these examples, you can see how the models respond differently to the same prompts.

Prompt: What does 'burning the midnight oil' mean?

GPT-4"Burning the midnight oil" is an idiomatic expression that means working late into the night. It originates from the past when people used oil lamps for light to work after the sun went down. So, if someone is "burning the midnight oil," they are up late working, often on a project or task that requires extra time.
WizardLM"Burning the midnight oil" is an idiom that means working late into the night, often until midnight or later, to complete tasks or meet deadlines. It is often used to describe an intense period of focused, dedicated work, often to the point of exhaustion. The phrase suggests that the person is spending so much time and energy on their work that they are essentially "burning" through their hours of rest and sleep, much like a candle burns through its wax.

Prompt: What is 5x5?

GPT-45x5 is 25.
WizardLM5x5 refers to a 5 by 5 grid or a 5-dimensional space. In mathematics, a 5x5 grid is a square grid with five rows and five columns. Each cell in the grid is identified by a pair of coordinates, (i, j), where i ranges from 1 to 5 for the row and j ranges from 1 to 5 for the column.

In a 5x5 grid, there are 25 cells in total. Each cell can be filled with a specific value or content, such as a number, a letter, or a symbol. For example, a 5x5 grid can be used to create a puzzle where you have to fill in the missing numbers or letters in a specific pattern.

In higher mathematics, a 5-dimensional space is a more abstract concept used to represent complex geometric structures. It's often used in fields like physics and computer science to model complex systems and relationships between different variables. In everyday life, a 5x5 grid can be used to represent a calendar with five columns for days of the week and five rows for different time slots in a day.

Prompt: What is an espresso tonic? What does it consist of?

GPT-4An espresso tonic is a refreshing and unique coffee-based beverage. It consists of two main ingredients:

1. Espresso: A concentrated form of coffee served in small, strong shots.

2. Tonic Water: A carbonated soft drink that contains quinine, which gives it a bitter taste. It's often used as a mixer in alcoholic beverages.

To prepare an espresso tonic, you fill a glass with ice, pour in tonic water, and then slowly add a shot of espresso on top. The result is a vibrant, layered drink with the bitterness of the espresso beautifully balanced by the sweetness and effervescence of the tonic water. Some variations might include a slice of citrus or a dash of simple syrup for added flavor. An espresso tonic is a refreshing and energizing drink that combines the bold flavors of espresso with the fizzy effervescence of tonic water. This popular beverage has gained popularity in recent years due to its unique blend of coffee and alcohol, offering a twist on the classic gin and tonic.
WizardLMTo make an espresso tonic, you will need the following ingredients:

1. Espresso: Start with a shot of freshly made espresso. You can use a machine or make it manually using an espresso maker. The quality of your espresso is crucial, as it will provide the base flavor for your drink.
2. Tonic water: Tonic water is a carbonated beverage containing quinine, which gives it a slightly bitter taste. It's usually mixed with spirits like gin to create a more complex flavor profile. In the case of an espresso tonic, the quinine will complement the richness of the espresso.
3. Optional: If you want to add a bit of sweetness and balance out the bitterness of the espresso, you can include a small amount of simple syrup or sugar. However, this is optional and depends on your personal taste preferences.
4. Optional: For a more adult version of the drink, you can add a shot of vodka, rum, or other preferred spirits to create an espresso tonic cocktail. This will give the drink a slightly boozy twist and add an additional layer of complexity to the flavors.
5. Optional garnish: To enhance the appearance and aroma of your espresso tonic, you can garnish it with a lemon wedge or a sprig of mint. This is also optional and depends on your preference.

To make the espresso tonic, simply combine the espresso and tonic water in a tall glass filled with ice. Stir or shake well to combine the ingredients. If you're adding a sweetener or spirits, you can do so before or after mixing the ingredients, depending on your preference. Finally, garnish with a lemon wedge or mint sprig, if desired.

The espresso tonic is a versatile drink that can be enjoyed at any time of the day, from an afternoon pick-me-up to a nightcap. It's a unique blend of coffee and alcohol that offers a refreshing and energizing twist on the classic gin and tonic.

These prompts are the key to accurate evaluation — we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets. We store the dataset securely and reserve it solely for use in this particular evaluation.

After collecting the prompts, we manually classified them by category and got the following distribution:


We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as HumanEval.

Stage 2: Human evaluation

Human evaluation of prompts was conducted by Toloka's domain experts.

Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the harmlessness, truthfulness, and helpfulness principles.

In other words, each model was compared to the same baseline model, rather than comparing each model to every other competitor model. Then we calculated the percentage of prompts where humans preferred the tested model's output over the baseline model's output (this is called the model's win rate). The leaderboard shows results in each category, as well as the average score overall for each of the tested models.

Most importantly, we ensured the accuracy of human judgments by using advanced quality control techniques:

  • Annotator onboarding with rigorous qualification tests to certify experts and check their performance on evaluation tasks.
  • Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
  • Monitoring individual accuracy by comparing each expert's results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.

Ready to compare LLMs?

Find your AI application's use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards (Hugging Face Open LLM Leaderboard, LMSYS, or others) for the big picture before you pick a model and start experimenting.

If you're interested in comparing more LLMs using our experts, or you need reliable evaluation of your model, we have the tools you need.

Reach out to our team to learn how Toloka can help you achieve the quality insights you're looking for.

Evaluate my model
Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.