Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Retrieval Augmented Generation (RAG): enhancing the pertinence of the generated content

Toloka Team

February 6, 2024

Essential ML Guide

No matter how smart large language models are, they may always have a knowledge gap. To solve this issue, the model can be retrained. Nevertheless, this is not necessarily the best solution, since additional training takes time, as well as significant computational and financial resources. This is where Retrieval Augmented Generation, RAG for short, comes to the rescue. Despite this unflattering acronym, it is a useful tool that allows foundation models like large language models to respond to questions that are not covered by the LLM training data.

What is RAG?

Retrieval Augmented Generation (RAG) is a method of working with large language models when a user inputs a query, adding additional information to it from external sources, and feeds it to the language model.. Thus, in other words, you can provide extra data in your query to the language model, so it can provide the user with a more complete and accurate answer.

Unlike traditional generative models, which rely solely on pre-trained data, RAG meaning goes beyond mere generation. It involves dynamic retrieval of information from knowledge bases. With The RAG approach the model has to find the relevant knowledge base entry depending on the user's question. Then it has to provide the LLM with both the query and the related part of the knowledge database content, for the LLM to produce the correct answer.

How Retrieval-Augmented Generation Works

Retrieval-augmented generation represents a type of a natural language processing (NLP) approach that combines components of both retrieval and generation models. The goal of RAG is to produce a more informed and relevant output than what a standalone text generator model might achieve. RAG combines three underlying components: pre-processing, information retrieval component, and response generation:

Raw Data Pre-processing for Knowledge Base Creation

Businesses are constantly updating data and adding new records to their databases. It is impossible to fit all of this dynamic data and documents into a prompt, as even the most recent versions of the language model, such as GPT-4, can fit about 50 pages of text into a prompt. For a company, let alone any large corporation, that's a very small chunk of data. That is why it is necessary to retrieve appropriate information from the knowledge base and add it to the query before prompting a language model.

Raw data can be of different formats, ranging from text data, and database entries to various files. All of the data that has to be fed into the LLM is then organized into individual pieces of text. The size of these pieces can vary from a few lines to several paragraphs.

In the process where a specialized embedding model transforms such data into numerical representations and then stores it in a vector database, knowledge bases or libraries emerge. Such a knowledge library can be fed to generative AI models. Vector databases store numerical representation of data chunks that facilitate the search and retrieval of certain information.

Retrieval of Additional Specific Context

The Retrieval-Augmented Generation process employs semantic search to extract more relevant and accurate pieces of data. It involves the LLM redirecting the user's prompt to a different model called the embedding model, where it is converted into a numeric form.

These so-called embeddings or vectors (user prompts converted into numeric form) are then compared with knowledge bases for relevant information related to the input. In the process, the embedding models locate relevant data. The retrieved information is then integrated into the input context for the large language model. Such retrieved data is considered an additional specific context.

Generation of Response

After context retrieval is done, the embedding model converts data back into a user-friendly format to add this information to the LMM response. The generated output is then submitted to the user. This response is expected to provide a pertinent and highly informative answer to the user's query, utilizing both the augmented prompt and its original data.

To keep the external knowledge base in an up-to-date state, and to get accurate and correct answers from the language model it is necessary to employ various data-science approaches, such as automated real-time processing or periodic batch processing. These allow for adding and modifying source documents and converting information into numerical representations for storage in a vector database.

Fine-tuning vs. RAG

Even if an organization has fine-tuned it’s LMM to handle domain-specific knowledge, it may encounter issues as it is impossible to re-train the model every time new information or documents appear. And as it was already mentioned, new data never stops appearing especially in large or medium-sized companies. The model simply won't give an answer when asked about new data that didn't exist before.

In this case, using RAG is a good idea. However, this does not exclude the use of fine-tuning, as the two processes are complementary rather than interchangeable. After such additional training, the model changes, i.e., gains new skills or changes its behavior, while RAG adds additional relevant information, which the LLM provides when prompted.

Fine-tuned model is a pre-trained language model additionally trained on a specific task or domain, therefore adapting it to solve/perform a specific task. Whereas RAG involves a retrieval mechanism that yields applicable information from a knowledge base and provides the model with access to external data previously unknown to it, so it can generate contextually relevant and accurate responses.

Fine-tuned model parameters adapted to perform well on a specific task, while RAG adapts the generation process by incorporating external information through retrieval. Both approaches bring their advantages in improving the performance of generative AI.

Fine-tuning brings global changes to the model, so an LLM can become an assistant in various fields, such as medicine, legal, etc., stops making a certain mistakes when generating answers or start responding to a prompt in a certain tone. RAG, on the other hand, is more responsive to changes and updates in the information.

Fine-tuning can be compared to becoming a doctor, which requires a student to acquire a medical degree after several years at a medical school. RAG, in turn, is a way for a doctor to gain new knowledge from a third-party research by carefully examining its textual description. The doctor would have a hard time understanding that description if he hadn't gone to med school.

Can RAG Be Used Without Fine-Tuning?

Pre-trained models are quite powerful on their own. They have already learned language patterns and general knowledge during their initial training, so such models can be applied to various tasks and used for retrieval-augmented generation without further adjustment.

If the basic pre-trained model possesses the functions you need, yet it lacks the desired data, RAG only will suffice. For many general use cases, you can achieve meaningful results using the base pre-trained model.

However, it may be mandatory to achieve optimal performance for some applications, especially when dealing with specific complex and demanding error-free domains or task requirements. In case the foundation pre-trained model cannot fulfill the desired functions, such as generating code, fine-tuning is a must.

Benefits of RAG

Besides the fact that Retrieval Augmented Generation makes a valuable addition to fine-tuning, RAG addresses several challenges of foundation model adaptation to a constantly changing data landscape. It offers several benefits in natural language processing tasks:

Customization for Specific Domains

Retrieval augmented generation can be adapted to the uncommon demands of particular industries, tasks, or applications, enhancing its relevance and performance in specific use cases. Such customization allows for the integration of domain-specific information. Moreover, domain experts may contribute to the knowledge library and steer the customization process, aligning the model closely to the subtleties and nuances of a particular domain.

Context Enrichment

RAG allows the model to incorporate external context from the data library, thus increasing the richness and contextual relevance of the generated responses.

This is especially beneficial in cases where the base model lacks training data or content that has been fed to it during additional tuning and therefore it fails to answer questions presented to it. A wider set of updated information should therefore be given to the model to create accurate and contextually relevant feedback.

Improved Factual Accuracy

When using conversational AI applications, users usually ask specific questions.. By retrieving information from a knowledge base, RAG models have the potential to generate more factually correct responses. In tasks where the input involves factually complex queries or requires detailed information, RAG models can excel by retrieving and incorporating accurate details from the knowledge libraries.

The factual accuracy in models using RAG heavily relies on the reliability and quality of the knowledge library. RAG leverages external knowledge bases that are validated and considered reliable sources of information. If the knowledge library is well-curated, and regularly updated, it enhances the model's ability to provide accurate and trustworthy facts.

Despite the focus on accurate information, it's important to be aware of potential biases present in the knowledge base. If the external source contains biases, the model may inherit or propagate them.

Decreased Risk of Hallucinations

Traditional generative AI models sometimes deliver imaginary, often false, responses when confronted with ambiguous prompts or insufficient training data. Backed by external knowledge, retrieval augmented generation enables the generation process to be guided by reliable information, thereby reducing the risk of generating content that has no foundation in reality.

Dynamic Context Updating

A retrieval augmented generation model can be dynamically updated meaning its internal knowledge can be adjusted without the need to retrain the entire model. This ensures that the model can adapt to new data, retrieve relevant information and provide responses based on the latest updates. This kind of RAG advantage may be beneficial when a company's information landscape is dynamic, where data is constantly being updated, amended or new documents and database records are added.

Trust, Transparency and Confidence

The RAG model outputs may contain references to the data sources. In addition to increasing transparency, this gives users the ability to verify the information on their own in case they are looking for additional clarity or more detail. This approach to generating textual information helps to increase the credibility of the content generated by AI and boost confidence in your generative AI solution.

RAG Is a Cost-Efficient Approach for Improving the Performance of Generative AI App

When first introduced to RAG for foundation models, one might think that it shares a lot in common with fine-tuned models and that these are two interchangeable approaches. However, this is not entirely true. In terms of implementation, retrieval augmented generation is more often less costly, but brings equally substantial results in the context of LMM customization.

Instead of relying on extensive training with massive datasets, retrieval augmented generation can tap into existing data sources, making more efficient use of available information. Such knowledge use can be more cost-effective than training models on specific datasets. By focusing on retrieving relevant information, rather than extensive training on task-specific data, RAG workflows may require fewer computational resources compared to models trained on massive datasets.

Goals of Retrieval Augmented Generation

The retrieval augmented generation work goal mainly consists of enhancing the factual accuracy of the generated content. RAG focuses on producing contextualized responses, taking advantage of both the language generation capabilities built into the pre-trained or fine-tuned model and knowledge available in external knowledge libraries.

The retrieval-augmented generation framework aims to address some limitations of generative models, such as the potential for generating incorrect and nonsensical information as well as the possibility of not generating any kind of response at all. By incorporating a retrieval step into the generation process, the model has access to a broader set of information, improving the quality and relevance of generated content.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

LLM evaluation framework: principles, practices, and tools

Jul 3, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

LLM evaluation framework: principles, practices, and tools

Jul 3, 2025

Beyond Next-Token Prediction: How Post-Training Teaches LLMs to Reason

Jul 1, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?