Products

Resources

Impact on AI

Company

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

viacheslav zhukov
viacheslav zhukov

Viacheslav Zhukov

Mar 7, 2023

Mar 7, 2023

Insights

Insights

Choosing the best architecture for your text classification task

Choosing the best architecture
Choosing the best architecture

A review of models based on real-world applications

Modern large language models (LLMs) have demonstrated the best performance for many NLP tasks from text classification to text generation. But are they really a “silver bullet” or a one-stop-shop solution? Can they be applied across the board? Toloka's ML team faces these kinds of tasks all the time, and our answer so far is a resounding “No.” Performance is not the only factor you should be concerned about when developing a model for a real use case. And you probably don’t want to spend your entire department’s budget on it either.

We've created a practical guide on ways to solve text classification problems – depending on how much data you have, the type of data (short or long texts, common or specific topics, etc.), time and budget constraints, computational and security requirements, and other factors.

Approaches to text classification

Let’s start off with a brief overview of potential models and solutions you could use.

Old-school tf-idf models

Models in this category are founded on basic stats like word count and co-occurrences. Reduced feature space is usually passed to one of the classic ML models like SVM, MLP or Naive Bayes. This method is easy to implement and does not require any specific libraries or accelerators — you’re good to go with one of the classic solutions like sklearn or NLTK. Moreover, these models can easily handle both short and long texts. Given their relatively small size, they’re highly efficient when it comes to training, deployment, and inference.

Nevertheless, this approach has several drawbacks, the most important one being performance. Compared to other approaches, tf-idf models rank the lowest. Additionally, you’ll have to carry out extensive preprocessing of your texts (misspellings, stopwords, punctuation, lemmatization, and more). You also need a lot of data to produce a robust model — and your data should resemble academic texts, not social media, which can contain numerous misspellings as well as slang.

First embeddings and pre-trains

Word embeddings are excellent for text classification since each word (or sequence of characters) is represented by a vector of numbers containing useful information about the context, use, and semantics.

Take for example Word2Vec, its variations and implementation within the fastText library — a binary file that you can run to achieve a desired result. While you still need a large dataset to produce solid word embeddings and to train the classifier head, with proper configuration you can significantly reduce your preprocessing efforts. As a single multiprocessing library, it can run right in the system’s console.

Training time increases with model size (as you have to learn and save an embedding for every token and classification head’s weights), but it is still at an acceptable level. On average, it takes anywhere from 10 minutes to an hour to create a fastText model between 300 MB and 2 GB. This model can handle texts of any size, and the inference is incredibly fast given that the sole focus is on text embedding construction and processing by an MLP. The availability of pre-trained word embeddings for a variety of languages makes it a baseline for almost any text classification problem.

Small transformers

This category includes transformer-based language models such as BERT and RoBERTa — currently considered to be state-of-the-art NLP. Even though it may seem obvious that a model with 110 million parameters is “small”, and a model with 175 billion parameters is “large”, it’s not easy to distinguish between small and large transformers. Yet there are several key advantages that make transformers a great option. Namely, they’re resistant to misspellings and usually require little preprocessing compared to other models.

Since you probably won’t be training your own BERT and will likely be using a pre-trained model from a library or hub (like Hugging Face), you can use comparatively small datasets to create decent models. If you have a common task, and your domain is similar to one that already has a tuned version, you may only need a couple hundred or thousand samples to slightly tune the model and achieve great results. The model size usually ranges from 600 MB to several gigabytes. It’s also a good idea to have access to GPUs because the training may take some time to complete.

However, there are also some disadvantages to consider. The produced model is much slower compared to Word2Vec, so if you require real-time inference you’ll need to either use a GPU device or invest in model optimization (graph optimizations, ONNX, DeepSpeed, and others). Additionally, the length of texts that a model can handle is limited by its architecture and is usually about 512 tokens (which corresponds to approximately 380 words), and it is up to you to decide from which part of your text these tokens should be taken. In practice, a simple approach like taking 192 tokens from the beginning works well.

LLMs

It’s likely that you don’t have your own LLM — they’re really big! The size of the large downloadable version of T5 is about 40 GB. You’ll have to deploy this model somehow and inference may take time. In which case, you’ll either need to use an expensive computational cluster or opt for a service that provides an API, like OpenAI with its GPT-3 model.

One benefit is that LLMs require little data for tuning, and you don’t need to worry about preprocessing. As a side note, approaches like zero-shot or few-shot don't work well for text classification problems. You’ll need to either fine-tune or p-tune (prompt-tune) your model. If you choose to use an API, you’ll also need to consider internet access, data security, SLAs, pricing, and more. However, achievable performance from using LLMs is the biggest plus.

Choosing by scenario

As expected, all these approaches have their own pros and cons. Consider a variety of architectures to find a good fit for your specific real-world task. We recommend basing your decision on the actual requirements you have for your text classification problem.

Let’s go over some of the most common cases we’ve encountered and our recommended approaches. Your text classification task will likely fall under one of these scenarios.

Your goal is to create a high-performing model

If performance really matters, choose a transformer. Spend some time searching for the optimal architecture and pre-trained weights, expanding your dataset, optimizing your pipeline and parameters, and so on. Also, try tuning an LLM, either your own or via an API. Just know that it will take time. You need to have expertise in ML and/or NLP to achieve the best results in this case.

You have little data available

Go for LLMs or a small tuned transformer. If your task is general enough, you can leverage extensive model catalogs that are available across various hubs.

You have a lot of data available

Start with fastText and establish a baseline. This may be enough if your performance requirements aren’t that strict. If they are, go for the fine-tuning process of one of the small pre-trained transformers. If an API is an option and you have a dedicated budget, you can try tuning an LLM too.

You have privacy or security concerns about your data

If you have privacy concerns, you don’t want your data to leave a specific contour and be logged by a third-party service. An API is not an option until you make your logging and security concerns clear to the provider. Choose local models that you can deploy yourself according to your hardware and software setup. Also keep in mind that data and model locations are important due to modern privacy legislations.

You have a common task and domain

Someone has probably already solved the task for you and you can apply their solutions. Simply look for applicable tuned transformers. LLMs will likely work too, but we’ve noticed that previously tuned transformers can outperform LLMs if the dataset is extremely small (a couple hundred samples). However, the difference is minute.

Your task or data is very specific

In this case, LLMs have the best performance compared to other approaches. Training an adequate small transformer is a challenge under these circumstances, and other architectures usually perform much worse.

Your model will be used for online inference (under hundreds of milliseconds)

Try fastText because of its speed. If you’re not satisfied with the quality, you can try using a small transformer. However, you’ll most likely have to use an optimization mechanism or deploy your model with access to a GPU. There are lots of ways to speed up inference with BERT-like models — even a brief description would take an entire article. Some of them may be implemented with just 1 parameter change or 3 lines of code. Others require rewriting of the whole model and training pipeline. LLMs are usually not an option here unless they’ve been optimized.

Your model will be used for batch processing only

Opt for a large model (or a small transformer without any optimizations). While it seems straightforward at first glance, you still need an understanding of timing so that your batches don’t stack up.

You’re concerned about scalability

For example, your model will be widely used, and you expect high RPS on its endpoint. You’ll probably apply an orchestration mechanism like Kubernetes, assuming that your pods can be deployed and destroyed quickly, in which case there may be restrictions on model size (namely, image size). Therefore, fastText and small transformers are common options.

You have access to a computational cluster with modern GPUs

If so, you're lucky! You can play with different types of transformers, even large ones, but the real question is, can you use a node pool with GPUs for inference? If the answer is yes, you can choose whatever you want, even LLMs. If not, you’ll probably find yourself optimizing a small transformer.

You have no access to modern hardware accelerators

That’s unfortunate to hear! Start with basic approaches and train a fastText model. You can also train a transformer model in this setup, but it will require a deeper understanding of optimization mechanisms. Another option is to move from classic libraries into something more specific like FasterTransformer.

You have a lot of time to build a model

Try any architecture, different pre-trains or parameters, and perform “loss-watching”.

You have almost no time to build a model

In this case, fastText and API-accessible LLMs are good options. If your task is popular, you can tune an appropriate small transformer with a default set of hyperparameters. Still, API-accessible LLMs are usually the best choice.

You need to create N models per day with good performance, and preferably automate the process

This scenario imposes size and computational limits on your models. Small or optimized models are a good fit. It won’t be reasonable to tune an LLM on a regular basis, or to store and deploy hundreds of copies of it. Another good option could be fastText, if it achieves sufficient performance.

A real-life story

So, what can go wrong if you choose an inappropriate architecture, apart from insufficient performance?

In one of our cases we had two key requirements for the classification model: good performance and the ability to handle text batches of 100 to 1000 texts every ten minutes. We trained a small transformer model, deployed it as an online-endpoint on a Kubernetes cluster, and everything worked just fine… Until an extremely large batch arrived.

It caused a system-wide hang for more than an hour, resulting in timeout exceptions for other batches. We realized that we got the wrong request statistics, and we had to properly configure an auto scaling mechanism on the cluster and invest more time and effort into model optimization. This is just one example of risks you can mitigate by careful planning.

What it all comes down to

In the end, there’s no “silver bullet” or “one-size-fits-all” approach. As a key takeaway, try to avoid looking at your text classification challenge with only performance in mind. There are other factors that are worth considering, like inference and training time, budget, scalability, privacy, and data type.

Connect with us

Feel free to reach out via Slack if you’d like to talk about these solutions and more. We’re always happy to hear from you!

About Toloka

Toloka is a European company based in Amsterdam, the Netherlands that provides data for Generative AI development. Toloka empowers businesses to build high quality, safe, and responsible AI. We are the trusted data partner for all stages of AI development from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise, offering the highest quality and scalability in the market.

Article written by:

viacheslav zhukov
viacheslav zhukov

Viacheslav Zhukov

Updated:

Mar 7, 2023

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?