Toloka Team
Toloka helps ServiceNow increase evaluation throughput multiple times
Human evaluation at scale for ServiceNow Research: how Toloka processes up to 3000 multi-turn dialogs per day
“One of the most remarkable aspects of working with Toloka is the invaluable support provided by their team during project setup, which we greatly appreciate. Their efficiency and organization in communicating instructions to experts significantly contribute to the overall success of the project” – Dzmitry Bahdanau, Principal Research Scientist, Lead, Service Now
Outcome
Fast scale-up of evaluation throughput
Data type
Human eval
The challenge
ServiceNow offers comprehensive AI solutions for automating business processes. One way they are embracing GenAI in their products is by building AI-powered conversational chatbots.
With the goal of offering exceptional conversational experiences for their customers, the ServiceNow Research team developed a research concept to improve conversational experience. The project needed vast amounts of high-quality data and precise human evaluation.
The outcome
The ServiceNow Research team started collecting data for their project, but they needed higher quality data and a lot more of it.
Collaborating with Toloka took the team’s throughput from 400 evaluation tasks per week to a peak of 3000 tasks per day — and improved data quality metrics at the same time, making an impressive contribution to the project’s success. All of the data was used for the research project, not for training models in production.
About ServiceNow
ServiceNow helps customers transform business processes with an AI-powered platform for automation across the enterprise: customer service, IT, employee productivity and HR services, cyber security, risk management, enterprise resource planning, finance, and more. Their industry solutions serve education, energy, government, healthcare, manufacturing, media, nonprofit, and retail sectors.
One of the company’s GenAI offerings is a conversational chatbot for IT service management ticketing. The goal is to enhance the customer experience and advance toward the next level of conversational interactions.
Evaluating chatbot responses to improve performance
The chatbot assists users in formulating their requests to speed up processing and make sure tickets are submitted correctly. The bot’s purpose is to gather information from the user, fill out a virtual form to create a service ticket, and send it to the support queue.
Like any generative AI model, the virtual agent needs training and testing on a multitude of data in order to provide the most relevant responses. The first step in this was to assess the quality of the chatbot's responses with human evaluation.
When the ServiceNow Research team began collecting data to improve the chatbot, they hit a point when limited throughput was holding them back. They needed to move faster, and they needed high-quality data to de-risk their research concept.
Why ServiceNow chose Toloka as a data partner
To address these challenges, the ServiceNow research team collaborated with Toloka, leveraging our expertise in evaluation. Following our work together on the BigCode project with HuggingFace, they were confident that the Toloka team could offer scalability, large-volume support, and agile project setup.
We implemented customized pipelines for evaluating the agent’s responses, ensuring that they meet criteria for responsiveness, transparency, accuracy, groundedness, and helpfulness.
ServiceNow benefited from Toloka's extensive experience with managing human evaluations and running quality control processes. After initial data was received, the two teams worked together to adapt the evaluation pipeline to achieve better efficiency.
Evaluating the agent's responses
The conversational agent developed by ServiceNow helps users resolve their issues by understanding their intent and offering guidance. To improve the agent’s responses, the ServiceNow team performs in-depth testing using a second LLM, the user model. The purpose of this LLM is to mimic potential user behavior so the agent model can encounter different user scenarios during training.
In this project, Toloka’s expert annotators evaluated the agent’s responses. The dialogs begin with a user request, continue with follow-up questions and answers, and end with generating a support ticket. Each response is generated and reviewed separately within the context of the dialog.
Expert annotators read interaction dialogs that end with an agent message. Each dialog has three parts: user messages, agent messages, and the agent’s "internal thoughts". The agent’s thought process is not visible to users, but it’s included in the evaluation dialogs to help developers and annotators detect errors in reasoning.
Annotators evaluate the agent’s last message in the context of the entire conversation, as shown in the screenshots below.
The evaluation task covers five binary criteria: responsiveness, transparency, accuracy, groundedness, and helpfulness. These criteria are based on the following expectations for the agent:
Be Responsive: The agent should deliver a response relevant to the user’s last message.
Be Transparent: If the agent makes a remark in its “internal thoughts,” it must notify the user.
Be Accurate: When users provide parameter values, the agent should remember them accurately.
Be Grounded: The agent should adhere to the context of the conversation and should not lie (hallucination check). Any suggestions should be based on the conversation or documentation rather than imagined scenarios. The agent should not show personality or sympathize with users.
Be Helpful: The agent should guide the dialogue towards completion, gently nudging users to provide necessary information or return to the original request if they change the topic. After gathering all relevant data, the agent should summarize the answers and ask the user to review them before creating a ticket.
Ensuring the quality of evaluation data
Around 10% of evaluation tasks gathered from expert annotators are audited for quality. Toloka auditors check results and decide if the quality is acceptable. Even if the results are good, auditors provide direct feedback to address minor issues and help annotators refine their skills.
Pipeline for evaluation projects
The setup that we designed for the ServiceNow research team freed up resources they otherwise would have spent on internal audits. Because the audit is part of our process, we can deliver metrics alongside the evaluation data to establish data quality. For this project, we calculated precision and recall for the "No" class (when annotators decided that the model did not meet requirements) due to its rarity and difficulty in detection. Current metrics meet the 85% threshold.
Impact: 250 times more data throughput and efficient collaboration
Toloka’s human evaluation pipelines increased the throughput of task processing from 400 tasks per week to 3000 tasks per day, providing a continual data supply for ServiceNow to fuel their research.
Initially attracted by Toloka’s approach to quality, the ServiceNow Research team also appreciated our efficient task setup and commitment to improvement. By relying on Toloka to manage expert annotators, implement quality control, monitor metrics, and dynamically adapt pipelines, they kept their focus on achieving the best possible results for chatbot performance.
Summary
When building LLM-based applications, it's important to realize that even though the model may perform well most of the time, it may hallucinate when facing edge cases or atypical user behavior. Identifying these malfunctions can be challenging without a partner proficient in data evaluation. Toloka’s expertise enhanced the ServiceNow data evaluation project, offering improved results through customized pipelines and expert management.
Is it time to scale your model evaluation process? Let Toloka build a customized pipeline similar to those used by ServiceNow and other market leaders. Whether you're aiming to improve efficiency, increase accuracy, or enhance any other evaluation metrics, our team is ready to assist. Contact us to discuss your requirements and explore the next steps towards optimizing your model's performance.
For more details, refer to the article TapeAgents: A Holistic Framework for Agent Development and Optimization and explore the framework on GitHub for further information and tools.
Article written by:
Toloka Team
Updated:
Oct 11, 2024