Sergey Koshelev
Diversity first: how we craft creative writing prompts for fine-tuning GenAI
Large language models have wowed us all with their ability to invent a story, write a poem, or craft an email. But the impact of a model with creative writing capabilities reaches far beyond the wow factor. Businesses anticipate large gains in employee productivity and accelerated growth by leveraging generative AI for writing tasks.
The question is: how do language models develop creative writing skills?
The secret lies in designing interesting prompts for model training. Generative models rely on prompts, which means that the quality of training prompts can shape the quality of the model’s output.
In a recent post on technical specifications for building your own LLM, we shared the exciting news that Toloka is working on a joint project to build a large language model from scratch. In today’s post, we’ll share some first-hand details on how we collect the best prompts to help the model develop essential creative writing skills.
Aren’t LLMs already good at writing?
Large language models are innately good at generating texts. They are built to predict the next word and combine words into coherent narratives. But they need to learn how to follow the prompt and answer in a particular style — that’s where fine-tuning comes in.
To make LLMs better at creative writing, we need strong prompts that can elicit unique and interesting texts for fine-tuning the model. Here we run into two limitations: a lack of diverse datasets for training, and a lack of benchmarks for evaluating quality.
The dataset problem: synthetic vs organic data
There are many open datasets available for training, consisting of sets of prompts and good responses (these include ShareGPT, WizardLM, OpenAssistant, Dolly, and Alpaca). However, most of these datasets are synthetic. In other words, the prompts and answers don’t always express ideas that people are interested in. Some training datasets even use prompts generated by the models themselves.
The goal of our project is to help the language model develop authentic human-sounding writing skills, so we’re generating an organic set of training data written fully by humans. While ours is certainly not the first LLM to use human-written prompt sets, we embarked on the project with a mission to make the scope and diversity of the data unique. We take full advantage of Toloka’s global crowd to collect the most diverse prompts imaginable.
What makes a good prompt
A good prompt is organic, and we invite Tolokers from all over the world to contribute prompts that draw on their own personal experience. They are assigned prompt writing tasks in different categories (essay, story, blog post, email, tweet, ad, movie script, and others) and they are free to write a prompt on any topic that interests them — as long as they follow our tips for the category in the task. Tolokers craft a prompt and submit it, without writing an answer for it. The text responses are later put together by experts in a separate project, and we’ll share those details in an upcoming post.
Our instructions encourage Tolokers to be creative and come up with prompts that contain plenty of context and aren’t too trivial or generic (“Write a poem about love” is a bad prompt). The guidelines for the task include examples of interesting prompts and not-so-interesting prompts. For instance, “Write a short story about a dog” is not interesting enough to evoke a creative response, but “Write a story about a timid dog who traveled to outer space to save the planet” provides a springboard for creativity.
Here is an excerpt from the instructions and a submitted prompt.
Example of submitted prompt
Creative writing prompts spotted in the wild
The best prompts include full context, like the topic, the basic plot and characters, or even an outline with the desired structure. To help Tolokers with writer’s block, we recommend making it personal — maybe there’s a topic that’s been on their mind lately, their child asked for a story, or they need to write something for work. Anything that is relevant to them or someone they know is a great subject for a prompt.
Here are some examples of writing prompts recently submitted by Tolokers.
Examples of good prompts
The evaluation problem: how to verify prompt quality
Now let’s consider the second roadblock: checking prompt quality. Benchmarks that evaluate the quality of models generally compare a text answer to the “golden” answer, word for word. But this isn’t a meaningful way to measure the quality of creative texts, which should be unique by definition.
The only reasonable way to judge creativity is to collect the opinions of real people. Unfortunately, it’s prohibitively expensive and time-consuming to read through thousands of texts. We’ve implemented a better solution: an automated human verification pipeline using the Toloka crowd.
Tolokers who perform verification tasks are given a prompt written by another person, along with the guidelines. They decide if the prompt follows the guidelines or not — a simple question of “good” or “bad”. The key here is to make the guidelines as clear and specific as possible, with plenty of examples of what makes a prompt good or bad.
Example of a verification task
Since this works like a binary text classification task, we can use our proven quality control methods in the pipeline to make sure verification is accurate. One of our core methods is overlap, meaning we assign multiple people to verify the same prompt and then aggregate their answers to get a reliable final verdict. We use the prompts that “pass” for model training and discard the ones that are rejected. With our large crowd of Tolokers, the process is fast and efficient.
Why we prioritize prompt diversity
We’ve collected over 6000 prompts for creative writing. But what’s impressive is the diversity in this dataset. Thousands of Tolokers from 69 different countries with a multitude of backgrounds and cultural contexts have contributed ideas that are meaningful to them.
A model that is trained on diverse data can emulate a broader range of human experiences and expectations, making it less biased and more helpful to real people in future applications.
Looking to collect prompts for your own LLM? Tap into the Toloka crowd for a global perspective. Reach out to discuss the ideal pipeline for your project.
Article written by:
Sergey Koshelev
Updated:
Oct 24, 2023