AI in Love: Crafting Love Poems with Large Language Models

Magdalena Konkiewicz
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Large Language Models (LLMs) have caused a substantial shift within the artificial intelligence sector, integrating into consumers' daily lives by offering assistance with tasks like text classification, summarization, and question-answering. But there is one field where LLMs excel: creative writing. They can craft emails, marketing slogans, or even full essays that are hard to distinguish from human-written content. Can these LLMs also woo us with poetry?

With Valentine's Day approaching, Toloka asked three widely accessible LLM-powered chatbots — ChatGPT, Gemini (formerly Bard), and Copilot — to craft love poems for the occasion. This article examines the quality of the poems, the potential for personalization, and whether results could be consistently replicated by the average user.

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us
Image

LLM Comparison: Fundamentals of Poem Crafting

In the initial phase of our investigation, Toloka tasked the chatbots with a simple challenge to write a short poem dedicated to Valentine’s Day, but we didn’t provide supplementary details. The exact text used is shown below, alongside the chatbots’ answers.

Image

The generated poems, while distinct in vocabulary choice, shared a structural symmetry, each comprising three stanzas with four lines. The poems from ChatGPT and Gemini follow a consistent AABB rhyme scheme within each stanza, where the first two lines rhyme with each other, as do the last two lines. This stylistic choice in these poems was most likely picked up from the training data.

To dive deeper into the LLMs’ poetry-writing skills, we ran these results by a group of literature experts to give us their opinions using the Toloka Deep Evaluation Platform. They found that all of the poems conveyed emotions pretty well and were grammatically and structurally sound. On the other hand, they criticized clichéd vocabulary and lack of continuity in the story of the poems. All of the experts preferred Copilot´s poem, commenting on the slightly more elaborated vocabulary and the fact that it conjures quite a lovely picture of lovers walking in a rose garden.

All of the poems, however, lack any type of personalization that one would expect from hand-written poems, and we will address this aspect in the next phase of the experiment.

LLM Comparison: Crafting Customized Poetic Narratives

To add a layer of personalization, the second iteration of our study introduced additional elements in the prompt provided to the model, including the lover’s name, the setting of the couple's initial encounter, and a brief account of their relationship’s evolution. Let’s examine how well each model weaved these details into their poems.

Image

All of the models maintained the four-line stanza format after running the prompt but increased the number of stanzas from three to four, most likely to incorporate all of the required details. The poems vividly brought Monica to life, capturing the genesis of the couple's romance and their journey, showcasing the models’ impressive capacity for personalization.

Gemini and ChatGPT kept the same AABB rhyme scheme, enhancing the lyrical quality of their offerings, while Copilot Chat gave the poem a name, ¨Monica´s Wave¨, and added a simple one-liner at the end for the occasion.

So what do our literature experts think about GenAI personalized poems? They strongly preferred the Copilot poem again due to its superior vocabulary and better flow. They also believed that the freeform (no rhyming scheme) used in this poem comes off as more sincere and personal. On the other hand, they noticed that the perspective of the poem was wrong and not written as a poem by the author to their girlfriend. Despite this small glitch, the experts agreed that this was the best AI-generated poem.

Once again, Gemini’s and ChatGPT's creations were criticized for being “lazy” and overusing simplistic phrases and vocabulary. Gemini’s poem received extra points for mentioning and building the story over the two-year time period. Copilot also captured this detail, leaving ChatGPT as the only AI model that did not pick up this small but important nuance.

Comparing LLM vs. Human Crafted Poetry

To add an extra layer of analysis to this research, Nik Barkley, VP of brand marketing and experience design at PAN Communications, crafted his own Valentine’s Day poems using the same prompts provided to the LLM models. Nik is creative at heart and dabbles in many different mediums, everything from watercolor art to poetry.

We then ran a brief poll using Dynata to survey 1,000 US consumers ages 18 and up to determine if average citizens could determine if each poem was developed by a human or by Generative AI.

Image

For Prompt 1, 58% of respondents correctly identified the human-generated poem, however, most respondents believed the three AI-generated poems to also be developed by humans - ChatGPT (59%), Bard/Gemini (56%) and Copilot (61%). Of the four poems, 31% of respondents preferred the human-generated poem, narrowly beating out the ChatGPT-generated poem (30%).

For Prompt 2, which asked the models to personalize the poem in more detail, 55% of respondents correctly identified the more personalized, human-generated poem. Most respondents still believed the three AI-generated poems to be developed by humans — ChatGPT (51%), Bard/Gemini (55%) and Copilot (52%). Respondents were more incorrect for Prompt 1, meaning the less personalized writing came across as more “human” than the more detailed prompts. Of the four poems, 31% of respondents preferred the human-generated poem, beating out the Copilot-generated poem (28%).

What this shows is that LLMs are very close to imitating humans in creative writing, especially in generalized contexts.

LLMs role in the future of creative writing

For those seeking to enchant their loved ones with a unique and personalized poetic gesture, LLMs present a promising avenue. Even though the experts preferred Copilot´s poems due to more original content, we can all agree that poetry is often a matter of taste and opinions may differ. All of the models produced acceptable results and we believe that the experiment showcases the impressive capabilities of current LLMs. This opens up conversations about the future of creative writing in the age of AI and shows that these tools can augment human creativity by helping us create personalized poems, texts, and stories.

So which poem was your favorite?

Poem examples created in this article were written by general models that are not specifically trained to produce poetry. Evaluating the output of these models is complex and nuanced due to the subjectivity of what a good poem is. With Toloka Deep Evaluation, we can evaluate metrics such as grammar, construction, novelty, flow, usage of metaphors, or other linguistic techniques to judge models´ performance. Deep insights from expert evaluators can provide valuable feedback to improve models across all types of domains.

Article written by:
Magdalena Konkiewicz
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.