Uploading dataset

An effective model produces good resulting quality. You need a dataset with the texts you want to label. It's great if you can provide a labeled dataset, but it's also okay if you label your dataset in Toloka LLM.

This dataset will serve as a reference to measure the quality of the labels a combination of a model and a prompt provides. You will use it to iterate through the variants to produce a variant of higher quality than you already have at hand. The platform provides you with a good default, but you can find the variant that will be even better.

Dataset file sample

Upload and label

To upload and label your dataset file:

  1. Click Upload dataset Upload dataset.

  2. Choose and open the file in either CSV or JSON format that contains your dataset.

    The dataset rows will be displayed in the Match data columns to input texts and labels area.

  3. Visually verify that it contains the data you want to use, then specify two columns:

    • Input (the required column), contains texts to label and is displayed under Text to label.

    • Label (the optional column), contains a class for each text and is displayed under Label (ground truth).


    If your dataset file doesn't contain labels, they will not be displayed (the Label (ground truth) drop-down menu will only contain the None option). In this case, you can add the classes together with their descriptions later and label your dataset using them.

    If your dataset file contains labels, check that the Text to label values match the Label (ground truth), or correct them if not, using the drop-down menu.

When done, click Save dataset.

Toloka LLM will calculate all quality metrics by comparing the model's output to the labels you provide.

Trying unlabeled data

You can run a model and see its output without labels. You can also deploy a variant without quality measurements if you wish so.

It's possible to use Toloka LLM without labeled data providing only texts for labeling. Note, that the tool provides the best value when there is a labeled dataset which serves as a ground truth in quality measurement.

To run a resulting model without labels, add a prompt and click Deploy Deploy.

Next steps

Last updated: September 5, 2023

Toloka LLMRegistering and signing inAdding OpenAI API keyTerminology
Working with Toloka LLM
Workflow overview
Working with datasets
Problem definition
Iterating on quality
Creating endpoint