Products

Resources

Impact on AI

Company

Toloka Team

Nov 29, 2024

Nov 29, 2024

Essential ML Guide

Essential ML Guide

How to build an LLM: what you need to know first

Large Language Models (LLMs) have become a cornerstone of modern artificial intelligence, transforming whole industries through natural language processing (NLP). From generating creative content to extracting valuable insights from data, LLMs are reshaping the way we interact with technology.

Developing a custom Large Language Model has traditionally been complex, requiring significant computational resources and expertise mostly available to tech giants. However, the rise of open-source tools and scalable cloud infrastructure has made this process more accessible. Today, organizations and teams of all sizes are designing and training specialized language models to meet their unique needs.

The global Large Language Model (LLM) market size is expected to be around USD 82.1 billion by 2033, from USD 4.5 Billion in 2023. Source: Market.US

Rather than a step-by-step tutorial, this overview highlights the essential milestones, challenges, and decisions involved in the process. This knowledge must help you estimate and plan the development of a robust LLM for your specific goals.

Use сase definition: why do you need an LLM?

Before planning your own Large Language Model (LLM) training and deployment, your team must clarify its purpose. A well-defined use case acts as a blueprint, guiding every decision in the development process.

The most common use cases for custom Large Language Models include:

  • Conversational agents: building customer support chatbots or virtual assistants.

  • Content creation: generating articles, marketing materials, or social media posts.

  • Text summarization: extracting key points from documents and reports.

  • Sentiment analysis: assessing the emotional tone in reviews or social media interactions.

  • Translation: delivering real-time translation between different languages.

Why defining a use case matters

Understanding what your model is intended to do is the first and most important step in determining whether you truly need a custom Large Language Model. Beyond this initial assessment, defining a clear use case is vital, as it influences model size and complexity, data requirements, and resources needed for the model’s deployment and maintenance.

Strategies to address the challenges in LLM projects also heavily depend on the specific use case, often inspiring innovative solutions tailored to unique constraints. For example, in 2024, Dr. Souvika Sarkar and her colleagues proposed a hierarchical, distributed model architecture that enables the efficient deployment of Large Language Models on less powerful devices like laptops.

A use case example — leveraging hierarchical language model architecture. Source: LLMs as On-demand Customizable Service

Custom LLM vs. fine-tuning existing models

Defining your use case also helps determine whether building a custom LLM is necessary or tuning an existing model would suffice. A custom model may be the better choice if you need:

  • Domain-specific expertise: training your LLM with specialized data ensures it aligns closely with your organization’s industry or workflow.

  • Data privacy and security: Incorporating sensitive or proprietary information into your model allows you to avoid potential risks associated with external or third-party systems.

  • Control and flexibility: owning and managing your LLM enables continuous updates and optimizations as your requirements evolve.

By clearly defining your use case at the start, you ensure a focused development process and establish a strong foundation for building a language model that effectively addresses your organization’s unique needs.

Hardware requirements for LLMs

Building and training your own Large Language Model still requires substantial computational resources. Depending on your model's complexity and goals, you must decide between deploying on-premise infrastructure or applying cloud solutions. Each approach has its advantages, which vary according to budget, scalability, and the expertise available within your team.

On-Premise Solutions

1. GPUs (Graphics Processing Units)

  • Overview: GPUs excel in parallel processing, essential for the matrix-heavy computations required in deep learning tasks. High-performance GPUs with at least 16GB of VRAM are considered the industry standard for training large language models.

  • Who chooses this? Big tech companies or AI-focused startups that can afford on-premise hardware investments often opt for high-performance GPUs to maintain greater control over their infrastructure.

2. TPUs (Tensor Processing Units)

  • Overview: TPUs are Google's proprietary hardware explicitly designed to accelerate machine learning tasks, particularly neural network computations. 

  • Who chooses this? Organizations already invested in Google Cloud infrastructure often choose TPUs to maximize the efficiency of their workflows. 

Comparison between GPUs and TPUs by a few key parameters. Source: GPU vs TPU for LLM Training: A Comprehensive Analysis

3. Storage solutions

  • Overview: Fast, high-capacity storage is essential for handling large datasets and model checkpoints during training. Solid State Drives (SSDs) are commonly used for their speed and reliability. 

  • Who chooses this? Businesses—usually mid-to-large enterprises—need to manage large datasets locally, including confidential or sensitive data. 

4. Other hardware solutions for LLM acceleration

  • Overview: Beyond GPUs and TPUs, hardware accelerators like FPGAs (Field Programmable Gate Arrays) and in-memory computing architectures are gaining traction for LLM training. FPGAs provide customizable hardware acceleration for specific workloads, while in-memory architectures minimize data transfer bottlenecks, significantly boosting energy efficiency.

  • Who chooses this and why? Organizations needing flexible and energy-efficient hardware for diverse applications may opt for FPGAs. Companies facing high memory bandwidth requirements might prefer in-memory accelerators like the TransPIM system. These solutions are particularly effective for large-scale parallel processing tasks.

Cloud Solutions

1. Cloud Providers
Overview
: Cloud platforms like AWS, Google Cloud, and Microsoft Azure offer access to virtual machines equipped with powerful GPUs and TPUs. These platforms are particularly attractive for projects prioritizing flexibility and organizations lacking the resources to maintain on-premise infrastructure.

Who chooses this and why?

  • Startups and SMEs: Small to mid-sized companies often rely on cloud computing due to the high upfront hardware costs and the challenges of maintaining a large server farm. Cloud services allow them to access cutting-edge GPUs and TPUs when needed, scaling resources up or down.

  • Companies who focus on scalability: Organizations with unpredictable workloads or limited in-house expertise in infrastructure management often favor cloud solutions. The ability to quickly scale resources ensures that projects can move forward without delays, even if computational demands increase unexpectedly.

2. Managed Services

Overview: Cloud providers provide managed machine learning services such as AWS SageMaker or Google AI Platform, which streamline the process of training, deploying, and monitoring machine learning models.

Who chooses this and why?

  • Organizations lacking In-House ML Expertise: Teams that lack the specialized knowledge to manage infrastructure themselves might choose a managed service. These platforms simplify the workflow by handling everything from model training to deployment, making it easier to focus on development and innovation.

  • Time Efficiency: Managed services allow users to get started quickly, eliminating the need to set up and maintain custom infrastructure.

Choosing the training dataset

The quality of your training dataset plays a pivotal role in determining the success of your Large Language Model (LLM). While many pre-existing datasets are available, building or augmenting your own dataset often becomes essential when developing a model tailored to specific tasks or domains.

Understanding different types of datasets

To ensure your LLM is effective across its lifecycle, it’s essential to recognize the various types of datasets used for specific stages of development:

  1. Pre-training Corpora

    • Purpose: Provides a broad understanding of language for foundational training.

    • Characteristics: These datasets are vast and diverse, typically spanning billions of text samples across multiple domains.

    • Examples:

      • Wikipedia and Common Crawl for general-purpose text.

      • The Pile for a mix of literature, coding, and dialogue data.

      • PubMed for biomedical and clinical applications.

  2. Instruction fine-tuning datasets

    • Purpose: Fine-tune the model to perform specialized tasks and follow structured prompts.

    • Examples:

      • FLAN datasets for high-quality task-specific fine-tuning.

      • Proprietary datasets tailored for tasks like customer service or legal document analysis.

  3. Preference datasets

    • Purpose: Used in reinforcement learning from human feedback (RLHF) to align model outputs with user preferences.

    • Examples:

      • Feedback datasets curated from ranking model responses.

      • OpenAI’s preference datasets or Anthropic’s RLHF datasets for ethical and helpful outputs.

  4. Evaluation datasets

    • Purpose: Benchmarking model performance across accuracy, relevance, and fairness tasks.

    • Examples:

      • GLUE and SuperGLUE for natural language understanding.

      • TruthfulQA to measure factual accuracy.

These dataset types ensure your language model is appropriately trained, fine-tuned, and evaluated to meet specific goals.

Using pre-existing datasets

Many developers start with publicly available datasets to save time and resources. Famous datasets like Common Crawl, Waymo Open Dataset (for autonomous vehicles), and ArXiv (academic papers) provide solid foundations for general or domain-specific applications.

However, these datasets often require augmentation or refinement to align with unique tasks. Many developers augment these datasets by combining them with the same data used in previous projects, ensuring consistency while tailoring to new objectives.

A timeline of some representative LLM datasets. Orange represents pre-training corpora, yellow—instruction fine-tuning datasets, green—preference datasets, and pink—evaluation datasets. Source: Datasets for Large Language Models

For example:

  • A healthcare AI system could combine PubMed with anonymized patient records to enhance relevance for clinical applications.

  • An autonomous vehicle project might start with the Waymo Open Dataset but add proprietary sensory data to handle specific weather or road conditions.

Preparing your own dataset

Custom datasets are essential for aligning your LLM with specific goals. When preparing your dataset, consider the following:

  1. Data quality

    • Remove duplicates, irrelevant entries, and noisy data to ensure clean inputs.

    • Incorporate diverse and representative data to avoid biases.

Example: A multilingual customer support chatbot must include datasets representing the nuances of each supported language.

  1. Transparency and licensing
    Licensing and provenance issues can affect both legal compliance and model performance. A research finished by MIT in 2024 found that over 70% of datasets lack proper licensing information, raising significant risks. 

Example: Tools like the Data Provenance Explorer help practitioners identify datasets that align with their intended use while ensuring compliance.

  1. Dataset size

    • For foundational models, use billions of samples to understand the language comprehensively.

    • For specialized models: Smaller, high-quality datasets specific to the domain are often sufficient.

Example: A financial LLM might rely on millions of financial reports and news articles instead of general web text.

Using diverse datasets

Whether you start with publicly available datasets, fine-tune with proprietary data, or build a custom dataset from scratch, aligning your dataset with your LLM’s intended purpose is critical. By leveraging diverse dataset types, such as pre-training corpora, preference datasets, and evaluation benchmarks, you can ensure your model performs effectively at every stage of its lifecycle.

Incorporating transparency and licensing tools and blending domain-specific data with general-purpose collections will allow you to create a robust and compliant LLM tailored to your organization’s unique needs.

Data preprocessing

Once you've selected your training dataset, the next critical step is data preprocessing. This process ensures your data is clean, standardized, and ready for the model to learn effectively. 

Key steps in data preprocessing

Tokenization

Tokenization means breaking text into smaller units, called tokens, that the model can understand. Tokens can be:

  • Words: Splitting by spaces or delimiters (e.g., "This is a test." → ["This", "is", "a", "test"]).

  • Subwords: Breaking down words into smaller elements to handle unknown or rare words (e.g., "unbelievable" → ["un", "believ", "able"]).

  • Characters: Treating each character as a token (e.g., "cat" → ["c", "a", "t"]).

Overfragmentation can finally make an LLM included to misgendering. Source: Tokenization Matters

Tools:

  • Hugging Face's Tokenizers Library supports various tokenization techniques, such as WordPiece, Byte Pair Encoding (BPE), and SentencePiece.

  • SpaCy: Offers prebuilt tokenizers for multiple languages.

Normalization

Normalization ensures text is converted into a consistent format, reducing variability in the dataset. Common normalization steps include:

  • Lowercasing: Standardizing text by converting all characters to lowercase (e.g., "HELLO" → "hello").

  • Removing punctuation: Eliminating unnecessary characters like commas and periods, unless needed for context (e.g., "Hi, there!" → "Hi there").

  • Expanding contractions: Replacing contractions with their full forms for clarity (e.g., "don't" → "do not").

  • Removing stop words (optional): Filtering out common words like "the" or "and" that don’t carry much meaning in most cases.

However, normalization is not always straightforward, as infrequent or specialized terms in the dataset—scientific jargon, regional slang, or non-standard language forms—are often processed with less accuracy, leading to inconsistencies in downstream tasks.

Tools:

  • NLTK: Offers robust text cleaning and normalization tools, including tokenization and stop word removal.

  • spaCy: A fast and modern library with built-in support for tokenization, part-of-speech tagging, and entity recognition, great for both normalization and advanced preprocessing tasks.

  • TextBlob: Great for handling contractions, stop word removal, and sentiment analysis with a simpler interface.

Dataset splitting

Splitting your dataset into training, validation, and test subsets ensures effective model evaluation. While traditional methods like Hold-Out (e.g., 80:20 training to validation ratio) are widely used, advanced techniques, such as Feature-Based Splitting, offer better balance and representation.

Feature-based dataset splitting improves balance and representation, reducing overfitting and enhancing model accuracy. Source: Automatic Optimization of Deep Learning Training through Feature-Aware-Based Dataset Splitting

Selecting the model architecture

Any team must decide between two primary approaches at this stage: building an LLM from scratch or adjusting an existing pre-trained model.

Building from scratch

Designing your own LLM allows for complete customization but requires significant resources and expertise. Additionally, it’s essential to consider that in-house specialists may lean toward a model architecture they are most familiar with, which could introduce bias into the decision-making process.

Transformer architecture

Most modern Large Language Models rely on the transformer architecture, which has revolutionized natural language processing (NLP) by efficiently handling long-range dependencies in text.

Key Components:

  • Self-attention mechanism: This allows the model to weigh the importance of different words in a sentence, improving context understanding.

  • Feed-dorward networks: These process the self-attention layers’ output to make predictions or generate new text.

An illustration of the main components of the transformer model from the original paper Attention Is All You Need that revolutionized the entire ML domain.  

Why transformers?

  • Parallel processing capabilities for faster training.

  • Ability to manage complex, sequential data with long-term dependencies.

  • Enhanced scalability for larger models.

Hyperparameter optimization

The number of layers, attention heads, and hidden dimensions directly affects the model’s accuracy and training time. Optimizing these parameters is essential to balancing performance and resource efficiency.

Choosing a pre-trained model

With numerous pre-trained models readily available and large research teams continuously developing new ones, fine-tuning has become the preferred approach for most organizations experimenting with Large Language Models.

  1. Available models
    Popular models like GPT-3, BERT, and T5 have been trained on massive datasets and are designed to handle a wide range of natural language processing tasks.

    • GPT-3: Excels at text generation, translation, and few-shot learning.

    • BERT: Is focused on bidirectional language understanding, making it perfect for tasks like sentiment analysis and question-answering.

    • T5: Frames all NLP tasks as text-to-text problems, standing out with its flexibility.

  2. Transfer learning
    This approach enables businesses to customize any pre-trained model for their specific applications. The transfer learning mechanism helps them save time and resources while leveraging the model’s deep knowledge of language patterns for their specific tasks.

Model training process

The model training consists of several steps, each directly impacting the model's ability to generalize, adapt, and generate meaningful text. 

Training frameworks

Due to their flexibility and robust features, frameworks like TensorFlow and PyTorch are widely used for LLM training. They also support distributed training, making managing the significant computational demands of Large Language Models easier.

  • PyTorch: Often preferred in research settings for its dynamic computation graphs, intuitive interface, and ease of debugging.

  • TensorFlow: Frequently chosen for production environments due to its scalability, extensive ecosystem, and support for deployment across various platforms.

Optimizers

The choice of optimizer significantly affects the model's ability to converge during training. Popular variants include:

  • Adam and AdamW: Known for their efficiency and ability to adapt learning rates for each parameter.

  • SGD: Typically used in simpler models, though less common in Large Language Models due to slower convergence.

Learning rate schedule

Adopting a learning rate schedule, which means gradually decreasing the learning rate over time, helps prevent overshooting the optimal solution.

Learning rate schedules can help optimize training for various models.  Source: Toward Optimal Learning Rate Schedule in Scene Classification Network

  • Warmup schedules: Begin with a low learning rate, gradually increase it, and then decrease it again as training progresses. This is especially useful for stabilizing early training for large-scale LLMs.

Monitoring and metrics

Constantly tracking training progress allows the team to identify overfitting or underfitting.

  • Loss and accuracy metrics: Monitor these on both training and validation sets.

  • Overfitting indicators: A significant gap between training and validation performance may indicate overfitting, requiring regularization or early stopping.

Checkpointing

Regularly saving model checkpoints prevents data loss and allows for resuming training after interruptions. This practice also enables iterative model tuning and experimenting with different configurations.

Fine-Tuning

This stage allows you to customize your LLM for particular use cases. Supervised fine-tuning (SFT) is the most common method, allowing you to adapt a pre-trained model using labeled, task-specific datasets.

Key steps in fine-tuning

  1. Selecting the dataset

    • Choose labeled datasets that align with the problem you’re trying to solve, such as customer interactions, medical records analysis, or legal document summarization.

    • Ensure the data is validated for quality and relevance.

  2. Adjusting hyperparameters

    • Tune parameters like the learning rate, batch size, and regularization strength to achieve optimal performance.

    • Consider Parameter-Efficient Fine-Tuning (PEFT) techniques for more efficient resource usage, where only a part of the model’s parameters is updated.

  3. Fine-tuning the model

    • Use frameworks like TensorFlow or PyTorch to train the model on the labeled dataset.

    • The process involves updating the model’s weights through backpropagation to minimize the loss function, ensuring the model produces accurate outputs for the task.

Steps and variations. Source: Fine-tuning and Utilization Methods of Domain-specific LLMs

LLM Alignment 

Responsible AI deployment presumes aligning your LLM with user expectations and ethical considerations. There are several approaches to achieve such alignment:

Reinforcement learning from human feedback (RLHF)

  1. Human feedback: Incorporate feedback from human reviewers during training to refine the model's outputs and ensure they meet user expectations.

  2. Reward signal: Use feedback to create a reward signal that guides the model's learning process and encourages it to generate more desirable responses.

Alternative approaches

  1. Rule-based systems: Implement rule-based filters or guidelines to help the model adhere to specific ethical standards and avoid generating unsafe content.

  2. Safety nets: Create safety nets by filtering out toxic or inappropriate content in the model's outputs to ensure that it aligns with user values.

Evaluation

Evaluating an LLM's performance is critical to understanding its reliability, usability, and ability to meet its intended goals. Robust evaluation involves a mix of quantitative and qualitative approaches tailored to the model's specific use cases.

Task-specific metrics

  1. Accuracy and F1 Score: These metrics are standard for classification tasks and provide insight into how well the model distinguishes between different categories.

  2. BLEU and ROUGE Scores:

    • BLEU (Bilingual Evaluation Understudy): Measures text similarity for tasks like machine translation by comparing n-gram overlap between generated and reference outputs. It accounts for precision and includes penalties for overly short translations.

    • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Commonly used for summarization tasks, it measures word/phrase overlap (ROUGE-N), the longest common subsequence (ROUGE-L), or weighted matches (ROUGE-W). These scores highlight how well the model captures key points of a reference summary while accounting for order and coherence.

    • Additional metrics, such as METEOR and TER, can provide complementary evaluations for text generation.

Research benchmarks

Language models are often tested against standardized benchmarks to evaluate their versatility across tasks:

  • GLUE and SuperGLUE: Comprehensive sentence understanding, reasoning, and logic benchmarks.

  • MMLU: Covers 57 diverse topics, including STEM and humanities, to assess general knowledge and reasoning.

  • SQuAD: A standard for evaluating question-answering capabilities.

  • Winograd Schema Challenge: Tests pronoun resolution using common sense reasoning【20】.

Perplexity and cross-entropy

These metrics assess how well the model predicts the next token in a sequence. Lower perplexity and cross-entropy values indicate better predictive performance and stronger alignment with the underlying language structure. When reporting these metrics, context length should be specified, as longer contexts often improve accuracy.

Robustness and fairness testing

Beyond accuracy, it’s essential to evaluate how the model performs under diverse conditions:

  • Adversarial testing: Challenges the model with deliberately misleading inputs to test its robustness.

  • Bias and fairness metrics: Measure disparities in predictions based on demographic or linguistic variations.

User-centered evaluation

  1. User studies: Gather feedback on model performance in real-world contexts. This helps identify issues like usability, coherence, or ethical concerns.

  2. A/B testing: Compare different model versions or fine-tuning approaches to determine the best performance in specific scenarios.

Stress testing for scalability

Evaluate the model’s ability to handle high loads and varied queries without degrading quality. This includes testing for latency, memory usage, and response consistency under increasing computational demands.

Deployment considerations

Infrastructure setup

Deploying on-premise infrastructure requires comprehensive data center management, addressing cooling and server reliability factors. In contrast, cloud deployments offer access to managed environments that minimize the complexity of physical infrastructure setup, allowing teams to focus on optimizing model performance.

API development

APIs bridge your model and the applications or users interacting with it. RESTful APIs are commonly used for model integration, allowing external systems to send requests and receive real-time predictions. 

Monitoring and maintenance

The team must monitor the model’s performance and resource consumption continuously. Cloud tools like AWS CloudWatch or Google Cloud Monitoring offer real-time insights to identify potential bottlenecks. On-premise solutions require similar systems for local metric visualization but provide more control over customization.

Scalability

Elastic scalability is a key benefit of cloud infrastructure, allowing easy access to additional resources during high demand. This is often automated using orchestration tools like Kubernetes. On-premise environments, while less flexible, can also scale with extra hardware. However, this requires careful planning to match future capacity needs.

Privacy and security

Protecting sensitive data and robust access control is essential to maintaining compliance with regulations and securing your LLM’s integrity. Key practices include:

  • Ensuring compliance with data protection regulations (e.g., GDPR) and anonymizing sensitive data within training datasets.

  • Using encryption to protect your model and data.

  • Restricting who can access the model and its outputs.

Ensuring responsible AI practices

Responsible AI implementation and ethical operation are essential to build trust and secure your business from potential users or regulator claims. It requires substantial expertise, with key practices including:

  • Bias mitigation: Identify and reduce biases in both training data and model outputs.

  • Transparency: Clearly communicate the model’s capabilities and limitations.

  • User education: Offer guidelines to help users interact with the model responsibly.

Role of domain experts in LLM training

Involving domain experts significantly improves the quality and relevance of your LLM, with their contributions including:  

  • Dataset curation: Experts help identify high-quality datasets tailored to the specific use case, ensuring the model is trained on relevant content.

  • Evaluation and feedback: Domain specialists provide insights, helping assess the model's performance and suggesting areas for improvement.

  • Ethical oversight: Experts help ensure the model aligns with industry standards.

However, gathering and managing large teams of specialists with relevant skills is a logistical challenge. Whether radiologists, lawyers, or civil engineers, experts are often in high demand and may be distributed across different regions.

For many organizations, the solution lies in crowdsourcing: leveraging a global network of professionals who can provide domain-specific expertise remotely.

Final thoughts

Building an LLM is a complex process that requires careful planning, well-organized workflows, and collaboration across many fields. It’s not just about the model itself; it also involves creating solid pipelines involving researchers, developers, and business domain experts. Given the high costs of AI projects and the growing investment in AI across industries, it may be rational to leverage expert teams that can help implement a robust overall ML strategy. 

However, nothing is impossible if you stay focused and follow the key steps outlined in this guide. By staying informed and committed to best practices, you can navigate the evolving landscape of LLM development, ensuring that your business stays ahead of the competition.

Article written by:

Toloka Team

Updated:

Nov 29, 2024

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?