Olga Megorskaya
The evolution of Toloka: From data labeling to data architecture
The rapid advancement of large language models (LLMs) has led companies to increasingly turn to synthetic data—AI-generated data that mimics real information. This technology marks the beginning of a new era, where the benefits of AI risk being overshadowed by the potential for harm. We believe that safe and responsible AI products require high-quality data grounded in human insight, with full transparency in data sourcing. While synthetic data is a powerful tool, and one we use ourselves, it still needs human insight for proper curation. Synthetic data is an important part of our offering, but not the only tool in our toolbox.
Toloka’s rich history puts our company in a unique position to embrace AI achievements and help shape AI development to better serve humanity.
Bridging Community and Technology
There has always been a gap between AI technology and the data needed to fuel it, and Toloka was born out of a need to fill this gap with large-scale human data labeling. The project grew into one of the largest crowdsourcing platforms on the planet, with people all over the globe enriching data with a wealth of perspectives.
The true strength of AI lies in its ability to reflect real-world experiences. Without human input, technology is useless to us; but without technology, human effort is not scalable. This understanding has guided our evolution over the past 10 years, continually seeking new ways to leverage the power of human knowledge and community with advanced technologies to harness it effectively.
The Role of Experts
As large language models took the world by storm, the data labeling landscape changed dramatically. At first, human intuition and general knowledge were sufficient for training models. Now that foundation models perform well in general skills like answering basic questions, the game has changed. LLMs need refinement for enhanced performance in fields like coding, medicine, mathematics, automotive, finance — the list goes on. There is a high demand for specialized, dedicated datasets to solve complex tasks in specific domains.
Toloka’s focus has evolved from labeling existing data to crafting custom datasets from scratch, writing complex dialogs between AI agents and humans on niche topics. Instead of relying on aggregate knowledge from the crowd, we curate unique contributions from highly educated professionals like physicists, doctors, and software developers to craft specialized data samples to help train and improve LLMs.
We recently established the Mindrift platform to bring together domain experts from around the world. The intention is to build on our experience scaling operational processes while grounding our efforts in cutting-edge research and our own insights as AI practitioners. The platform is designed to foster community in a deeper way, with experts in Mindrift focused on teamwork and collaboration to reach a goal, in stark contrast to the crowdsourcing model of anonymous contributions. Expert knowledge is funneled into data production pipelines that ensure high-quality results to support SFT, RLHF, and model evaluation.
A New Stage: Data Architecture
Today, we stand at a pivotal moment where the role of a data architect is crucial. It's not enough to simply connect the dots between Tolokers with requesters; we must design efficient pipelines that augment human insight with synthetic data generated by LLMs. As architects, we analyze the model’s needs and plan how to balance data types, and we run experiments to ensure the final dataset will truly make the model better.
LLMs offer many opportunities to optimize data production, beyond generating synthetic data. We incorporate models into our pipelines with innovative co-pilots and auto checks that reduce routine work for our experts and improve datasets overall. We’ve discovered a new level of synergy between human input and AI optimization.
Toloka is building the foundation for future growth by refining our platform, expanding our community of professional experts, deepening our knowledge of the data we work with, and advancing our product to democratize access to this data for the AI community.
Article written by:
Olga Megorskaya
Updated:
Aug 22, 2024