Subscribe to Toloka News
Subscribe to Toloka News
Imagine you're building a chatbot to answer investment questions for banking customers. Before you can start fine-tuning a foundation model, you need to make a big decision: which Large Language Model will be best suited to drive your AI tool?
To create your shortlist, you'll probably compare a host of factors like model size, inference options, latency, and costs. But ultimately you want to know which model will meet your expectations and serve your customers responsibly. You certainly don't want your investment chatbot to mislead anyone with murky financial advice, or risk sounding unfriendly or annoying. How can you effectively compare model performance on relevant business cases?
Toloka's new LLM Leaderboard is an excellent starting point. Our LLM ranking zeroes in on what's essential for reliable and relevant model comparisons.
Curious about the rankings? Check out the Toloka LLM Leaderboard here:LLM Leaderboard
Keep reading to learn how we compare models and why it's important.
Our model evaluation process has two stages: prompt collection and human evaluation.
We collected our own dataset of high-quality prompts to use for LLM evaluation.
Other evaluation methods use open-source prompts, but they are not reliable enough for accurate evaluation. Using open-source datasets can be restrictive for several reasons:
To mitigate these issues, we collected original prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT).
In these examples, you can see how the models respond differently to the same prompts.
|GPT-4||"Burning the midnight oil" is an idiomatic expression that means working late into the night. It originates from the past when people used oil lamps for light to work after the sun went down. So, if someone is "burning the midnight oil," they are up late working, often on a project or task that requires extra time.|
|WizardLM||"Burning the midnight oil" is an idiom that means working late into the night, often until midnight or later, to complete tasks or meet deadlines. It is often used to describe an intense period of focused, dedicated work, often to the point of exhaustion. The phrase suggests that the person is spending so much time and energy on their work that they are essentially "burning" through their hours of rest and sleep, much like a candle burns through its wax.|
|GPT-4||5x5 is 25.|
|WizardLM||5x5 refers to a 5 by 5 grid or a 5-dimensional space. In mathematics, a 5x5 grid is a square grid with five rows and five columns. Each cell in the grid is identified by a pair of coordinates, (i, j), where i ranges from 1 to 5 for the row and j ranges from 1 to 5 for the column.|
In a 5x5 grid, there are 25 cells in total. Each cell can be filled with a specific value or content, such as a number, a letter, or a symbol. For example, a 5x5 grid can be used to create a puzzle where you have to fill in the missing numbers or letters in a specific pattern.
In higher mathematics, a 5-dimensional space is a more abstract concept used to represent complex geometric structures. It's often used in fields like physics and computer science to model complex systems and relationships between different variables. In everyday life, a 5x5 grid can be used to represent a calendar with five columns for days of the week and five rows for different time slots in a day.
|GPT-4||An espresso tonic is a refreshing and unique coffee-based beverage. It consists of two main ingredients:|
1. Espresso: A concentrated form of coffee served in small, strong shots.
2. Tonic Water: A carbonated soft drink that contains quinine, which gives it a bitter taste. It's often used as a mixer in alcoholic beverages.
To prepare an espresso tonic, you fill a glass with ice, pour in tonic water, and then slowly add a shot of espresso on top. The result is a vibrant, layered drink with the bitterness of the espresso beautifully balanced by the sweetness and effervescence of the tonic water. Some variations might include a slice of citrus or a dash of simple syrup for added flavor. An espresso tonic is a refreshing and energizing drink that combines the bold flavors of espresso with the fizzy effervescence of tonic water. This popular beverage has gained popularity in recent years due to its unique blend of coffee and alcohol, offering a twist on the classic gin and tonic.
|WizardLM||To make an espresso tonic, you will need the following ingredients:|
1. Espresso: Start with a shot of freshly made espresso. You can use a machine or make it manually using an espresso maker. The quality of your espresso is crucial, as it will provide the base flavor for your drink.
2. Tonic water: Tonic water is a carbonated beverage containing quinine, which gives it a slightly bitter taste. It's usually mixed with spirits like gin to create a more complex flavor profile. In the case of an espresso tonic, the quinine will complement the richness of the espresso.
3. Optional: If you want to add a bit of sweetness and balance out the bitterness of the espresso, you can include a small amount of simple syrup or sugar. However, this is optional and depends on your personal taste preferences.
4. Optional: For a more adult version of the drink, you can add a shot of vodka, rum, or other preferred spirits to create an espresso tonic cocktail. This will give the drink a slightly boozy twist and add an additional layer of complexity to the flavors.
5. Optional garnish: To enhance the appearance and aroma of your espresso tonic, you can garnish it with a lemon wedge or a sprig of mint. This is also optional and depends on your preference.
To make the espresso tonic, simply combine the espresso and tonic water in a tall glass filled with ice. Stir or shake well to combine the ingredients. If you're adding a sweetener or spirits, you can do so before or after mixing the ingredients, depending on your preference. Finally, garnish with a lemon wedge or mint sprig, if desired.
The espresso tonic is a versatile drink that can be enjoyed at any time of the day, from an afternoon pick-me-up to a nightcap. It's a unique blend of coffee and alcohol that offers a refreshing and energizing twist on the classic gin and tonic.
These prompts are the key to accurate evaluation — we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets. We store the dataset securely and reserve it solely for use in this particular evaluation.
After collecting the prompts, we manually classified them by category and got the following distribution:
We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as HumanEval.
Human evaluation of prompts was conducted by Toloka's domain experts.
Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the harmlessness, truthfulness, and helpfulness principles.
In other words, each model was compared to the same baseline model, rather than comparing each model to every other competitor model. Then we calculated the percentage of prompts where humans preferred the tested model's output over the baseline model's output (this is called the model's win rate). The leaderboard shows results in each category, as well as the average score overall for each of the tested models.
Most importantly, we ensured the accuracy of human judgments by using advanced quality control techniques:
Find your AI application's use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards (Hugging Face Open LLM Leaderboard, LMSYS, or others) for the big picture before you pick a model and start experimenting.
If you're interested in comparing more LLMs using our experts, or you need reliable evaluation of your model, we have the tools you need.
Reach out to our team to learn how Toloka can help you achieve the quality insights you're looking for.Evaluate my model