Toloka documentation


crowdkit.aggregation.texts.text_summarization.TextSummarization | Source code

    tokenizer: PreTrainedTokenizer,
    model: PreTrainedModel,
    concat_token: str = ' | ',
    num_beams: int = 16,
    n_permutations: Optional[int] = None,
    permutation_aggregator: Optional[BaseTextsAggregator] = None,
    device: str = 'cpu'

Text Aggregation through Summarization

The method uses a pre-trained language model for summarization to aggregate crowdsourced texts. For each task, texts are concateneted by | token and passed as a model's input. If n_permutations is not None, texts are random shuffled n_permutations times and then outputs are aggregated with permutation_aggregator if provided. If permutation_aggregator is not provided, the resulting aggregate is the most common output over permuted inputs.

To use pretrained model and tokenizer from transformers, you need to install torch

M. Orzhenovskii, "Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 8-14.

S. Pletenev, "Noisy Text Sequences Aggregation as a Summarization Subtask," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 15-20.

Parameters Description

Parameters Type Description
tokenizer PreTrainedTokenizer

Pre-trained tokenizer.

model PreTrainedModel

Pre-trained model for text summarization.

concat_token str

Token used for the workers' texts concatenation.

Default value: `

num_beams int

Number of beams for beam search. 1 means no beam search.

Default value: 16.

n_permutations Optional[int]

Number of input permutations to use. If None, use a single permutation according to the input's order.

Default value: None.

permutation_aggregator Optional[BaseTextsAggregator]

Text aggregation method to use for aggregating outputs of multiple input permutations if use_permutations flag is set.

Default value: None.

device str

Device to use such as cpu or cuda.

Default value: cpu.

texts_ Series

Tasks' texts. A pandas.Series indexed by task such that result.loc[task, text] is the task's text.


import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
from crowdkit.aggregation import TextSummarization
device = 'cuda' if torch.cuda.is_available() else 'cpu'
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
agg = TextSummarization(tokenizer, model, device=device)
result = agg.fit_predict(df)

Methods Summary

Method Description
fit_predict Run the aggregation and return the aggregated texts.