TextHRRASA

crowdkit.aggregation.texts.text_hrrasa.TextHRRASA | Source code

TextHRRASA(
    self,
    encoder: Callable[[str], Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]],
    n_iter: int = 100,
    tol: float = 1e-05,
    lambda_emb: float = 0.5,
    lambda_out: float = 0.5,
    alpha: float = 0.05,
    calculate_ranks: bool = False,
    output_similarity: Callable[[str, List[List[str]]], float] = glue_similarity
)

HRRASA on text embeddings.

Given a sentence encoder, encodes texts provided by workers and runs the HRRASA algorithm for embedding aggregation.

Parameters description

Parameters	Type	Description
`encoder`	Callable[[str], Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]]	A callable that takes a text and returns a NumPy array containing the corresponding embedding.
`n_iter`	int	A number of HRRASA iterations.
`lambda_emb`	float	A weight of reliability calculated on embeddigs.
`lambda_out`	float	A weight of reliability calculated on outputs.
`alpha`	float	Confidence level of chi-squared distribution quantiles in beta parameter formula.
`calculate_ranks`	bool	If true, calculate additional attribute `ranks_`.

Examples:

We suggest to use sentence encoders provided by Sentence Transformers.

from crowdkit.datasets import load_dataset
from crowdkit.aggregation import TextHRRASA
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-mpnet-base-v2')
hrrasa = TextHRRASA(encoder=encoder.encode)
df, gt = load_dataset('crowdspeech-test-clean')
df['text'] = df['text'].str.lower()
result = hrrasa.fit_predict(df)

Methods summary

Method	Description
fit_predict	Fit the model and return aggregated texts.
fit_predict_scores	Fit the model and return scores.

Last updated: March 31, 2023

Crowd-Kit

Overview

Reference

Aggregation

Datasets

Learning

Metrics

Postprocessing