Toloka Team
Spoke.ai: Summarization as rocket fuel
Summary
Client: Spoke.ai
Task: Evaluate the quality of suggested action items generated by the models
Goal: Compare 2 production pipelines to achieve the best summarization quality
Metrics evaluated: Truthfulness, Helpfulness, Language
Spoke.ai develops an AI-powered inbox that automatically summarizes priority content in productivity tools like Slack, helping teams stay on top of their work and build products faster. The tool can give users a detailed summary of a conversation thread with full descriptions of issues raised and who participated, as well as a list of action points that need to be addressed after a discussion.
The Spoke.ai team partnered with Toloka to examine the quality of GenAI performance on action point summarization. The goal was to compare two LLMs and accurately measure quality on this task. Toloka set up a deep evaluation pipeline to discover which model is better at summarizing action points — an older pipeline based on GPT-3.5, or a newer pipeline based on GPT-4.
Working with Toloka confirmed that we could deliver significantly increased value to our users with the new GPT-4 based pipeline and helped identify further opportunities to improve. More accurate suggested actions help our users power through their inbox.
— Gráinne McKnight, Founding Data Science Lead, Spoke.ai
The challenge: Defining quality metrics for model comparison
Spoke.ai needed to evaluate quality and gain insights into performance for two purposes:
Identify problematic aspects of the model's performance before deploying it in production for real users.
Optimize existing pipelines with a balanced tradeoff between quality and cost.
To measure quality accurately, we aligned expectations for model output. The first step was to define what action points should be included in the summary and shared with users. We settled on these criteria:
The action must be explicitly stated during the conversation.
Participants must agree on it. We chose a relaxed definition of agreement, meaning if someone proposed an action and no one refuted it, it is accepted as an action point.
The action must be directly work-related.
Completed items don't count as action points.
The next step was to check whether the model's description of the action was accurate, meaning the assignee was identified correctly and the description was free of factual errors and distortions.
It's equally important to correctly identify when the conversation didn't have any action points. In fact, the majority of discussions may conclude without any unfinished actions, and the summary should reflect this. Model hallucinations (fabricated action points) can have a negative impact on the user’s perception of product quality.
The solution: Aggregated metrics for truthfulness and helpfulness
Our evaluation plan focused on two key metrics: truthfulness and helpfulness.
Truthfulness measures how well the model's summary matches the source conversation. It's calculated as an aggregated score of 4 low-level metrics:
Does the action proposed by the model meet the criteria above for including it in the summary?
Does it correctly identify the assignee?
Does it reflect the action description from the source text?
Does it identify conversations that don't have any action points?
These metrics are objective enough to measure pointwise for model comparison.
Helpfulness measures how useful the model's summary is to the user. Pointwise evaluation isn't practical for a subjective metric like helpfulness, so we broke evaluation down into 4 low-level metrics with some pairwise comparisons:
Action coverage. Which model covers more actions found in the source text?
Details. Which model includes more context in the action description?
Briefness. Which model output is more concise (does not use unnecessary words and phrases in the description)?
Repetitiveness. Does the summary repeat the same actions? (This is the only pointwise metric in this section.)
Each low-level metric is labeled separately by our expert annotators, then we compute an aggregated score for both high-level metrics.
The details: How the evaluation pipeline works
Toloka's expert annotators use color highlighting to analyze the summary texts and then answer questions to rate them. Each rating task focuses on a single action. If the model output has multiple action points, they are split into separate tasks and assessed separately. Here is an example of the annotator's task for labeling truthfulness:
And here is a task for rating helpfulness:
Score aggregation:
Truthfulness
Each action is assessed individually. We score the action as correct if it meets the criteria (work-related, agreed, and unfinished), correctly identifies the assignee, and has an accurate description. It's also scored as correct if it accurately states that the conversation doesn't have any action points. If any of these requirements are not met, we score the action as incorrect. All the actions in the summary must be correct. Otherwise, the entire summary is considered incorrect.Helpfulness
We choose the winning model in this category based on which model has more wins across 3 metrics (Action coverage, Details, Briefness) and declare a tie if the scores are equal.
The results: A clear winner in both categories
We compared Spoke.ai's previous pipeline, based on GPT-3.5, to the new pipeline based on GPT-4. The new pipeline performed significantly better on both truthfulness and helpfulness. However, the metrics revealed some quality issues with GPT-4 that demand attention.
Truthfulness
Win rates:
GPT-3.5: 10%
GPT-4: 44%
Tie: 46%
Insights into truthfulness:
The GPT-3.5 pipeline has a strong tendency to make up actions or mention actions that were already completed.
The GPT-4 pipeline, on the contrary, has a bias to return “No actions” even when action points are present in the conversation.
Both models sometimes fail to correctly identify the assignee, but GPT-4 is much better in this aspect.
GPT-4 has a higher relative share of errors in the action description. It could be connected with the fact that GPT-4 tends to generate more verbose outputs and sometimes misattributes small details, which are harder to get right than the general idea of an action.
Error distribution:
Helpfulness
Win rates:
GPT-3.5: 12%
GPT-4: 38%
Tie: 50%
Helpfulness broken down by metrics:
Insights into helpfulness:
The GPT-4 pipeline gives more verbose, detailed outputs, almost always providing users with enough context to understand the action. In this aspect it is much better than the GPT-3.5 pipeline, whose outputs are often missing important details.
At the same time, the outputs generated by the new pipeline may contain unnecessary information, which could be easily dropped without any context loss.
The new pipeline is also better at detecting actions in the conversation.
Since the new GPT-4 pipeline offers significantly better action detection and fewer false action extractions, we can say it does a better job of understanding action points and meeting user expectations. From here, the team can use prompt engineering or fine-tuning to fix errors in the model output — putting Spoke.ai on a faster path to stellar results.
The outcome: A confident choice for Spoke.ai
Our approach to task decomposition and deep evaluation focuses on specific business goals for the language models. We evaluated Spoke.ai's production pipelines on what matters most for the quality of action point summaries and found a clear advantage to using their new GPT-4-based pipeline. Accurate metrics allow the Spoke.ai team to make informed decisions by comparing model quality within the context of their product’s goals, and justify the cost of switching to a newer model.
Toloka's deep evaluation also brought specific quality issues to light, giving the team a solid understanding of their product's quality on action generation and a list of areas to work on next. With this new set of quality metrics, the Spoke.ai team can track improvements and reach new levels of efficiency together with their users.
Article written by:
Toloka Team
Updated:
Dec 6, 2023