Viacheslav Zhukov
Text Classification on Extra Small Datasets: Fine-tuning vs ChatGPT
We at the Toloka ML team continue researching and comparing approaches to the text classification problem under different conditions, and here we present another experiment particularly focusing on the performance of different NLP models when applied to extra-small datasets. In our previous articles, we provided a brief overview of potential solutions and compared classical models with large language models (LLMs) for a specific task. However, those comparisons were based on a "regular" dataset containing a sufficient number of data points to build a reliable classifier. In real-world scenarios, one may encounter situations where limited data is available, or human labeling has not yet been performed. Intuitively, LLMs such as GPT-3 or ChatGPT might outperform smaller models due to their extensive "knowledge”.
To investigate this hypothesis, we will create an artificially small dataset by extracting a portion of a larger one and compare several approaches. We will fine-tune the RoBERTa base model, employ ChatGPT for few-shot classification, and also fine-tune the GPT-3 Babbage model.
The Dataset
To evaluate the comprehension capabilities of various models, we selected a multiclass dataset comprising scientific article abstracts, with the task of determining each article's domain. We opted for the WOS-11967 [1] dataset, which contains 11,967 documents with 35 categories which include 7 parent categories: medical domain, psychology, computer science, biochemistry, electrical engineering, civil sciences, and mechanical engineering. We sampled 10,000 data points and focused solely on the parent categories for our analysis.
While the dataset is not perfectly balanced, the class distribution is reasonably proportional, allowing for the possibility of achieving satisfactory results across all classes. The class distribution is illustrated in the image below.
The class distribution of the sample of the WOS-11967 dataset
Upon manual analysis, we observed that determining the domain of some abstracts is relatively straightforward, while in other cases, the task becomes more challenging. For instance, computer science articles may discuss mathematical topics, or psychology articles might employ numerous medical and biochemical terms and abbreviations, making it difficult to distinguish them from biochemistry or medical domains. The abstracts also vary significantly in length, with a mean of 274 tokens (ChatGPT tokens) and a standard deviation of 115 tokens.
To simulate scenarios involving extra-small datasets, we performed a train-test split on the corpora, allocating a small number of samples to the training set. We repeated this process three times with different training set sizes to evaluate the performance changes in the models based on the available training data. We created three splits for our experiment: WOS-11967-s200 (200 samples in the training set, 9,800 samples in the test set), WOS-11967-s500 (500 / 9,500), and WOS-11967-s2000 (2,000 / 8,000).
Now, let's examine the results obtained using different models to tackle these problems.
Regular fine-tuning with RoBERTa
For our baseline, we selected the RoBERTa base model [2] and fine-tuned it on these 3 datasets. and fine-tuned it on the three datasets mentioned earlier. We used the same hyperparameter configuration for each run (batch size of 32, learning rate of 3e-5, linear scheduler with warmup, and 256-token window), along with early stopping to prevent overfitting. We obtained the following results:
We can clearly see that 200 samples are insufficient for extracting all the necessary patterns and information to accurately classify the abstracts. The lower macro-average F1 score also indicates that the model underperforms on under-represented classes like Mechanical Engineering, suggesting that having only a few samples from a particular class is inadequate. As anticipated, the model's performance improves as the amount of available data increases, ultimately resulting in a fairly robust performance for multiclass classification across seven classes.
Few-shot with ChatGPT
The second approach we explored was few-shot classification using ChatGPT. This method differs significantly from traditional classification, as it doesn't involve training a model per se; instead, we engineer the input prompt to achieve optimal performance. However, it's impossible to feed all 200 samples into the model due to its 4096-token context size limit. So, given the measurements above, we could only present around 14 abstracts to the model, and that number is further reduced when considering the tokens used for instructions and delimiters.
Initially, we employed the system
role for instructions and provided a single example per class to guide the model's response. We simplified the class names to single tokens while retaining their meaning, making it easier for the model to select the appropriate category and limiting the output to a single token. For instance, "Biochemistry" became "Bio," and "Computer Science" became "Computer." Additionally, we restricted token generation by providing a list of classes to choose from and instructing the model to return the "Unknown" token if it is unsure about the category.
Unfortunately, the performance achieved with this method was inferior compared to the RoBERTa model trained on just 200 samples. We also noticed that the model's classification ability heavily depends on the supplied prompt. Modifying a single sentence could either improve or worsen the metrics. In some cases, ChatGPT missed categories despite explicit instructions not to do so (which could be a drawback of our prompt formulation). In a few edge cases, it produced categories not listed in the instruction but actually described the articles' domains, such as "Math" or "Chemistry." It's unclear whether these flaws should be attributed to the model or the dataset, but according to the validation set, these categories can be effectively corrected using simple rules, like changing all instances of "Math" to "Computer."
In pursuit of improved metrics, we attempted to utilize as much data as possible (since we still can't feed all 200 samples into the model). We devised a two-stage process: first, we asked the model to identify similarities between abstracts from a specific domain and generate summaries; second, we incorporated these summaries into the instruction to provide the model with insights about the classes and features identified by the model itself in the first stage. Essentially, this approach allowed us to feed more training data samples into the model, and it worked – we boosted metrics by approximately 10%. The following is the prompt we used to generate these summaries:
The prompt for ChatGPT used to extract meaningful information about article domains
For each domain, we supplied approximately 7-8 abstracts, resulting in a total of around 63 distinct abstracts used to prepare the classification prompt (8 abstracts per 7 classes to build summaries, and 7 abstracts provided as examples in the actual prompt). Nevertheless, we instructed the model to respond with "Unknown" if uncertain about the class, and observed in the validation set that the majority of "Unknown" responses corresponded to Computer Science articles. We subsequently replaced all "Unknown" instances with the "Computer" class.
The resulting classification prompt looked as follows:
The final prompt for ChatGPT used to classify article abstracts
Once again, performance was heavily influenced by the prompt and the samples provided as examples. The model also generated several categories outside the target list, requiring manual adjustments based on the validation set. This approach yielded the following results:
The performance was notably better than fine-tuning a RoBERTa model on 200 samples, and it also required fewer samples. However, as the availability of labeled data increased, RoBERTa began to outperform this approach, even with just 500 samples. We believe that further performance improvements are possible through proper prompt engineering. Some useful tips and tricks can be found in resources such as the Prompting Guide.
Fine-tuning a GPT-3 model
For our final approach, we fine-tuned the GPT-3 Babbage model on the three datasets. We adhered to the dataset preparation recommendations outlined in the OpenAI guide and opted for the default hyperparameters without making any specific adjustments. The training process for each dataset took approximately 20 minutes, yielding the following results:
The fine-tuned GPT-3 model delivered impressive results even on the smallest dataset, surpassing both RoBERTa and ChatGPT. As the amount of training data increased, the performance gap between RoBERTa and the tuned GPT-3 model narrowed, raising questions about the resources and feasibility of using either option. We discussed the advantages and disadvantages of both approaches in our previous articles.
The key takeaway is that you shouldn’t approach this problem with only the performance in mind. There are other factors like latency, budget, availability, along with legal and privacy concerns. OpenAI models are available via API and bill you on a per-token basis, while your own small model or even your own LLM can be deployed directly on a virtual machine or a kubernetes cluster, and you’ll be paying only for the hardware time. The latter approach requires necessary engineering skills though, and might not fit every business.
Conclusions
The experiment demonstrates that our initial intuition is reflected in reality – larger models trained on more extensive data perform significantly better on extra-small datasets. With proper prompt engineering and few-shot techniques, it is possible to achieve favorable results. However, this performance difference diminishes as the dataset size increases. As illustrated in our article, "Best Architecture for Your Text Classification Task: Benchmarking Your Options,” a domain-adapted RoBERTa model can sometimes outperform generic large language models (LLMs).
Reference
Kowsari K, Brown DE, Heidarysafa M, Jafari Meimandi K, Gerber MS, Barnes LE. HDLTex: Hierarchical Deep Learning for Text Classification. In: Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference On. IEEE; 2017.
Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR. 2019;abs/1907.11692. http://arxiv.org/abs/1907.11692
Article written by:
Viacheslav Zhukov
Updated:
Jul 20, 2023