crowdkit.learning.text_summarization.TextSummarization
| Source code
TextSummarization( self, tokenizer: PreTrainedTokenizer, model: PreTrainedModel, concat_token: str = ' | ', num_beams: int = 16, n_permutations: Optional[int] = None, permutation_aggregator: Optional[BaseTextsAggregator] = None, device: str = 'cpu')
Text Aggregation through Summarization
The method uses a pre-trained language model for summarization to aggregate crowdsourced texts.
For each task, texts are concateneted by |
token and passed as a model's input. If
n_permutations
is not None
, texts are random shuffled n_permutations
times and then
outputs are aggregated with permutation_aggregator
if provided. If permutation_aggregator
is not provided, the resulting aggregate is the most common output over permuted inputs.
To use pretrained model and tokenizer from transformers
, you need to install torch
M. Orzhenovskii, "Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 8-14. https://ceur-ws.org/Vol-2932/short1.pdf
S. Pletenev, "Noisy Text Sequences Aggregation as a Summarization Subtask," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 15-20. https://ceur-ws.org/Vol-2932/short2.pdf
Parameters | Type | Description |
---|---|---|
tokenizer | PreTrainedTokenizer | |
model | PreTrainedModel | Pre-trained model for text summarization. |
concat_token | str | Token used for the workers' texts concatenation. Default value: | . |
num_beams | int | Number of beams for beam search. 1 means no beam search. Default value: |
n_permutations | Optional[int] | Number of input permutations to use. If Default value: |
permutation_aggregator | Optional[BaseTextsAggregator] | Text aggregation method to use for aggregating outputs of multiple input permutations if Default value: |
device | str | Device to use such as Default value: |
texts_ | Series | Tasks' texts. A pandas.Series indexed by |
Examples:
import torchfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfigfrom crowdkit.learning import TextSummarizationdevice = 'cuda' if torch.cuda.is_available() else 'cpu'mname = "toloka/t5-large-for-text-aggregation"tokenizer = AutoTokenizer.from_pretrained(mname)model = AutoModelForSeq2SeqLM.from_pretrained(mname)agg = TextSummarization(tokenizer, model, device=device)result = agg.fit_predict(df)
Method | Description |
---|---|
fit_predict | Run the aggregation and return the aggregated texts. |