Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Throwing a Cat Among the Pigeons? Augmenting Human Computation With Large Language Models

Ujwal Gadiraju

September 4, 2023

Insights

throwing-a-cat-among-the-pigeons-augmenting-human-computation-with-large-language-models

I have always been fascinated by etymology. More often than not, there is an intriguing story behind how words and phrases have acquired the meanings we are so familiar with. Morphing through the ages and mingling with changing times. The Mechanical Turk was a chess-playing humanoid machine made by a Hungarian author and inventor, Wolfgang von Kempelen, in the 18th century. The story goes that Mechanical Turk toured Europe and humbled noteworthy names like Napoleon Bonaparte and Benjamin Franklin in fabled battles of chess. Only later was the secret sauce unveiled in the form of a real human chess whiz hidden away from the naked eye in a cabinet beneath the floorboard, from where they controlled the moves made by the humanoid.

This story was the inspiration behind the naming of the Amazon Mechanical Turk crowdsourcing platform launched in 2005. The platform was designed to solve tasks that could not be solved by contemporary alternatives and required human input or intelligence. It was in this context that the notion of “artificial artificial intelligence” took shape and form, wherein humans serve as the source of intelligence when it is beyond the capabilities of machines. We have come a long way from there, to the cusp of a brand new notion of “artificial artificial artificial intelligence”. Yes, you read that right. Three artificials. Before you try to crack this walnut with your forehead, let’s take a quick tour down some memory lanes.

The Early Days of Crowdsourcing

In his book called “The Wisdom of Crowds’’ published in 2004, James Surowiecki explored and synthesized the attributes required to form a wise crowd — one that can often make decisions better than any single individual in the crowd. He identified diversity in opinion, independence in judgment, and decentralized knowledge as vital attributes to that end. In 2006, Jeff Howe coined the term crowdsourcing as a portmanteau of ‘crowds’ and ‘outsourcing’ in an article he wrote for Wired magazine on “The Rise of Crowdsourcing.” He discussed how businesses had begun tapping into the collective capabilities of distributed online communities through open calls to accomplish certain tasks.

Amazon Mechanical Turk thrived after it was first launched, and within years, hundreds of thousands of people around the globe found an opportunity to earn their livelihoods by completing tasks on the platform. This sparked growth in crowdsourcing platforms across the world, forging and cementing a new economy of online microtask crowd work. Researchers and practitioners began to rely on crowdsourcing platforms to accomplish various tasks and demonstrated that even highly complex tasks could be decomposed and crowdsourced. Systems and tools were proposed to support crowd workers in completing tasks effectively. Some prominent examples of contemporary crowdsourcing platforms include Toloka AI (“a data-centric environment to support fast and scalable AI development with the help of human insight”) and Prolific (a platform to “conduct research or train the next generation of AI”).

In 2009, the release of ImageNet spurred the entire field of machine learning. With over 3.2 million images in 12 subtrees with over five thousand synsets, it was a monumental data collection effort using crowdsourcing via Amazon Mechanical Turk [1]. This provided an unprecedented opportunity for progress in several computer vision tasks, including object recognition and image classification.

Let us not forget that this progress came with a great set of trials and tribulations. Tremors of the dangers concomitant with relying on human-generated data, prone to cognitive and systematic biases, were felt by many. In 2013 a group of well-known researchers in the crowdsourcing community wrote a paper called “The Future of Crowd Work,” in which they reflected on the status of the paradigm and the series of challenges that needed immediate addressing [2]. Many of these challenges still remain unsolved a good 10 years later, despite a significant amount of progress. There have been well-documented problems pertaining to the quality of data collected (e.g., propagation of biases), power asymmetry on platforms, abysmal hourly wages, unfair work rejections, invisible labor, unhealthy work environments, and the list goes on. Despite the frailties of what some consider a fractured paradigm of work, remarkable results punctuate the historical timeline, and the power of crowdsourcing has unmistakably contributed to a rate of technological progress only a few would have seen coming.

The Intriguing Age of Generative AI

Much of the mainstream media around the world today is lost in sweeping narratives around generative AI and what the democratization of large language models can mean. Many more lives will continue to be touched by AI in expected and unexpected ways. And it is the laborious work of humans behind the scenes that has been fuelling this AI revolution in the first place. If we were to “scrutinize the shadows of AI, we would discover the humans powering it,” as unforgettably put by Mary Gray and Sid Suri in Ghost Work [3].

Exaggerated forecasts and clickbait headlines have likened the role of humans in this age to anxious pigeons and equated LLMs to being bold cats — with the metaphorical cat disrupting the flock, sending them away scattering. But what does the onset of this new age of generative AI models truly mean for human input? Has the need for human input been wiped out for the most part in shaping future technologies? In the remainder of this article, I will argue that the answer to this is a resounding negative and that the main shift we should expect is in the nature of human input that will continue to be needed.

I recently co-authored a workshop paper exploring how human computation workflows can embrace the emergence of generative AI models [4]. This work was presented at the Generative AI Workshop at the premier HCI conference, ACM CHI 2023, held in Hamburg earlier this year. We highlighted the potential role that large language models (LLMs) can play in augmenting existing crowdsourcing workflows and discussed how such workflows can be empirically evaluated.

A Primer on Crowdsourcing Workflows

Crowdsourcing workflows are distinct patterns that manage how large-scale tasks are decomposed into smaller tasks to be completed by crowd workers. The crowd-powered word processor, Soylent, applies the Find-Fix-Verify workflow to produce high-quality text by separating tasks into stages of generating and reviewing text. This enabled “writers to call on Mechanical Turk workers to shorten, proofread, and otherwise edit parts of their documents on demand [5].” The Iterate-and-Vote workflow has been deployed in creating image descriptions, where workers are first asked to write descriptions of images (for example, with an end goal to assist those who are blind). Subsequent voting tasks are then used to converge on an optimal description [6]. The Map-Reduce workflow has been proposed for “partitioning work into tasks that can be done in parallel, mapping tasks to workers, and managing the dependencies between them [7].” Sharing the same essence, tools like CrowdWeaver have been proposed for managing complex workflows, supporting data sharing between tasks, and providing monitoring tools and real-time task adjustment capability [8].

Boosting Crowdsourcing Workflows with LLMs

It is unlikely that the emergence of language models renders such workflows, frameworks, and tools completely mundane. On the contrary, the crowdsourcing community is uniquely positioned to embrace the benefits that LLMs can bring by building on decades of research around effective workflows, human-in-the-loop approaches, and knowledge around building hybrid human-AI systems.

The human-centered perspective of developing technologies focuses on augmenting human experiences in everyday life and amplifying the abilities of people. If LLMs can indeed help crowd workers in completing tasks, they should be embraced and integrated in a fashion that empowers workers to complete tasks more accurately and quickly or in a fashion that improves their overall experience in one way or another.

Researchers in information retrieval (a community I have engaged with over the last decade) have recently considered what the proliferation of LLMs can mean for the role of human annotators in the context of relevance judgments for evaluation [9]. They proposed a spectrum of collaboration between humans and LLMs to produce relevance judgments (ranging from human judgments to fully automatic assessments, akin to the popular levels of automation). The authors explored the potential benefits of roping-in LLMs within an assistive capacity for annotation tasks and weighed them in juxtaposition to the risks of doing so. It is clear that LLMs can reduce annotation costs in creating evaluation collections. However, it is unclear whether such collections could be systematically different from those created by humans and how such artifacts would influence the evaluation of information retrieval systems and, thereby, the future design of such systems.

Apart from supporting individual writing or classification tasks within a workflow, researchers are also exploring the application of LLMs in assisting crowd workers. Liu et al. combined the generative power of GPT-3 and the evaluative power of humans to create a new natural language inference dataset that produces more effective models when used as a training set [10]. In a similar vein, others introduced a ‘Generative Annotation Assistant’ to help in the production of dynamic adversarial data collections, significantly improving the rate of collection [11]. However, there are several less-understood open questions pertaining to how LLMs can improve the effectiveness of crowdsourcing workflows and how such workflows can be holistically evaluated.

Many Hurdles Along the Way?

Much like humans, LLMs can also be prone to bias and unfairness. On one hand, prior work has shown how human annotators fall prey to their own opinions while completing annotation tasks, leading to systematic biases creeping into the resulting data collection [12]. Others have proposed checklists for either combating or reporting potential cognitive biases that may have emerged during the annotation process [13]. On the other hand, recent work has revealed discriminatory stances and stereotypical biases present in LLMs [14, 15].

The human computation and crowdsourcing research community (HCOMP) has devised a number of effective methods, interfaces, measures, and tools to ensure the collection of high-quality data from crowd workers. It is only a matter of time before we collectively figure out how such quality-related guarantees can be laid out while integrating LLMs in decision-making pipelines.

On the surface, the integration of LLMs into crowdsourcing workflows can appear to be rather straightforward. As with most proposals for solutions related to complex systems, it is easier said than done. Crowdsourcing has many different stakeholders involved: the task requesters who are keen to gather large-scale annotations, the crowd workers willing to oblige in return for compensation, the platforms providing the infrastructure and serving as the marketplace for these transactions to take place, and indirect end-users of products or technologies that are developed or built in downstream efforts. The impact of including LLMs in workflows has the potential to affect each stakeholder in different ways.

If crowd workers can become more effective and efficient by leveraging LLMs in intelligent workflows, there is potential to get more work done without increasing costs. However, further work is required to gain a better understanding of the risks and rewards entailing the inclusion of LLMs as a part of crowdsourcing workflows. Who would be responsible for designing, developing, and integrating LLMs into such workflows, considering the potential need for accountability?

Crowd workers have historically been left to their own devices to improve their productivity and the environments and conditions within which they operate. Shouldn’t it now be the collective responsibility of crowdsourcing platforms and task requesters to better understand how to equip workers with LLM-based solutions that can aid them in successful task completion and improve and augment their work experiences?

Artificial Artificial Artificial Intelligence and the Future That Can Be

A recent case study explored the extent to which crowdsourced data from “humans” in a text summarization task was genuinely generated from humans. The authors found evidence to support that over 30% of crowd workers in their study on Amazon Mechanical Turk have already begun to rely on LLMs [16]. Although the study reported these insights from 44 workers alone, and the numbers can be taken with a grain of salt, this does reflect the undeniable prospect of more crowd workers turning towards LLM-based solutions that can help them increase their productivity, maximize their earnings, and improve the time they spend in crowdsourcing marketplaces. This is where the notion of “artificial artificial artificial intelligence” arises — crowd workers potentially making use of AI (assistance from LLMs) to provide what is presumably “human” input on demand.

Figure: An illustration depicting the emergence of “artificial artificial artificial intelligence” coined in [29] from AI (1) to AAI (2) and finally AAAI (3). Source: Image by author

Further consideration is needed regarding the transparency and explainability of LLMs compared to what can be elicited from humans. When crowd workers complete tasks such as annotation or others that require decision-making, task requesters can extract meaningful rationales through follow-up questions. Crowd workers have the wherewithal to provide such insights where needed. The same cannot be currently achieved with LLMs. Yes, there are methods for model explainability, but none have demonstrated a level of effectiveness on par with what can be achieved with humans at both ends of the line. This perception of LLMs as a “black box” can create barriers to adoption for task requesters and crowdsourcing platforms, while also impeding the appropriate reliance of crowd workers on such tools.

Humans and LLMs? There is an endless stream of possibilities with a sea of intriguing questions and only a handful of glimmering answers. Seizing the opportunity to integrate this technological advancement to improve crowd work is less like stirring a hornet’s nest and more like catching a gust of wind in our sails. Let us get busy, for a beautiful future awaits when we can shape it with humans taking center stage.

References

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE.
Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M. and Horton, J., 2013, February. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work (pp. 1301–1318).
Gray, M. L., & Suri, S. (2019). Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon Dolan Books.
Allen, G., He, G., Gadiraju, U. Power-up! What Can Generative Models Do for Human Computation Workflows? In Proceedings of the Generative AI Workshop at ACM International Conference on Human Factors in Computing Systems (CHI 2023).
Bernstein, Michael S., Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. “Soylent: a word processor with a crowd inside.” In Proceedings of the 23rd annual ACM symposium on User Interface Software and Technology, pp. 313–322. 2010.
Little, G., Chilton, L. B., Goldman, M., & Miller, R. C. (2009, June). Turkit: Tools for Iterative Tasks on Mechanical Turk. In Proceedings of the ACM SIGKDD workshop on human computation (pp. 29–30).
Kittur, A., Smus, B., Khamkar, S., & Kraut, R. E. (2011, October). Crowdforge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology (pp. 43–52).
Kittur, A., Khamkar, S., André, P. and Kraut, R., 2012, February. CrowdWeaver: visually managing complex crowd work. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (pp. 1033–1036).
Faggioli, G., Dietz, L., Clarke, C., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B. and Wachsmuth, H., 2023. Perspectives on Large Language Models for Relevance Judgment. arXiv preprint arXiv:2304.09161.
Liu, Z., Roberts, R.A., Lal-Nag, M., Chen, X., Huang, R. and Tong, W., 2021. AI-based language models powering drug discovery and development. Drug Discovery Today, 26(11), pp.2593–2607.
Bartolo, M., Thrush, T., Riedel, S., Stenetorp, P., Jia, R. and Kiela, D., 2021. Models in the loop: Aiding crowd workers with generative annotation assistants. arXiv preprint arXiv:2112.09062.
Hube, C., Fetahu, B. and Gadiraju, U., 2019, May. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–12).
Draws, T., Rieger, A., Inel, O., Gadiraju, U., & Tintarev, N. (2021, October). A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI conference on human computation and crowdsourcing (Vol. 9, pp. 48–59).
Abid, A., Farooqi, M. and Zou, J., 2021, July. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 298–306).
Nadeem, M., Bethke, A. and Reddy, S., 2020. StereoSet: Measuring stereotypical bias in pre-trained language models. arXiv preprint arXiv:2004.09456.
Veselovsky, V., Ribeiro, M. H., & West, R. (2023). Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. arXiv preprint arXiv:2306.07899.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Detecting hidden harm in long contexts: How Toloka built AWS Bedrock's advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

Detecting hidden harm in long contexts: How Toloka built AWS Bedrock's advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

LLM evaluation framework: principles, practices, and tools

Jul 3, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?