TextSummarization

crowdkit.learning.text_summarization.TextSummarization | Source code

TextSummarization(
self,
tokenizer: PreTrainedTokenizer,
model: PreTrainedModel,
concat_token: str = ' | ',
num_beams: int = 16,
n_permutations: Optional[int] = None,
permutation_aggregator: Optional[BaseTextsAggregator] = None,
device: str = 'cpu'
)

Text Aggregation through Summarization

The method uses a pre-trained language model for summarization to aggregate crowdsourced texts. For each task, texts are concateneted by | token and passed as a model's input. If n_permutations is not None, texts are random shuffled n_permutations times and then outputs are aggregated with permutation_aggregator if provided. If permutation_aggregator is not provided, the resulting aggregate is the most common output over permuted inputs.

To use pretrained model and tokenizer from transformers, you need to install torch

M. Orzhenovskii, "Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 8-14. https://ceur-ws.org/Vol-2932/short1.pdf

S. Pletenev, "Noisy Text Sequences Aggregation as a Summarization Subtask," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 15-20. https://ceur-ws.org/Vol-2932/short2.pdf

Parameters description

ParametersTypeDescription
tokenizerPreTrainedTokenizer

Pre-trained tokenizer.

modelPreTrainedModel

Pre-trained model for text summarization.

concat_tokenstr

Token used for the workers' texts concatenation.

Default value: | .

num_beamsint

Number of beams for beam search. 1 means no beam search.

Default value: 16.

n_permutationsOptional[int]

Number of input permutations to use. If None, use a single permutation according to the input's order.

Default value: None.

permutation_aggregatorOptional[BaseTextsAggregator]

Text aggregation method to use for aggregating outputs of multiple input permutations if use_permutations flag is set.

Default value: None.

devicestr

Device to use such as cpu or cuda.

Default value: cpu.

texts_Series

Tasks' texts. A pandas.Series indexed by task such that result.loc[task, text] is the task's text.

Examples:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
from crowdkit.learning import TextSummarization
device = 'cuda' if torch.cuda.is_available() else 'cpu'
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
agg = TextSummarization(tokenizer, model, device=device)
result = agg.fit_predict(df)

Methods summary

MethodDescription
fit_predictRun the aggregation and return the aggregated texts.

Last updated: March 31, 2023

Crowd-Kit
Overview
Reference
Aggregation
Datasets
Learning
Metrics
Postprocessing