crowdkit.aggregation.embeddings.hrrasa.HRRASA | Source code

n_iter: int = 100,
tol: float = 1e-09,
lambda_emb: float = 0.5,
lambda_out: float = 0.5,
alpha: float = 0.05,
calculate_ranks: bool = False,
output_similarity: Callable[[str, List[List[str]]], float] = glue_similarity

The Hybrid Reliability and Representation Aware Sequence Aggregation (HRRASA) algorithm consists of four steps.

Step 1. Encode the worker answers into embeddings.

Step 2. Estimate the local workers' reliabilities that represent how well a worker responds to one particular task. The local reliability of the worker kk on the task ii is denoted by γik\gamma_i^k and is calculated by incorporating both types of representations:

γik=λembγi,embk+λseqγi,seqk,  λemb+λseq=1\gamma_i^k = \lambda_{emb}\gamma_{i,emb}^k + \lambda_{seq}\gamma_{i,seq}^k, \; \lambda_{emb} + \lambda_{seq} = 1,

where the γi,embk\gamma_{i,emb}^k value is a reliability calculated on embedding, and the γi,seqk\gamma_{i,seq}^k value is a reliability calculated on output.

The γi,embk\gamma_{i,emb}^k value is calculated by the following equation:

γi,embk=1Ui1aikUi,kkexp(eikeik2eik2eik2)\gamma_{i,emb}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'} \exp\left(\frac{\|e_i^k-e_i^{k'}\|^2}{\|e_i^k\|^2\|e_i^{k'}\|^2}\right),

where Ui\mathcal{U_i} is a set of workers' responses on task ii.

The γi,seqk\gamma_{i,seq}^k value uses some similarity measure simsim on the output data, e.g. GLEU similarity on texts:

γi,seqk=1Ui1aikUi,kksim(aik,aik)\gamma_{i,seq}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'}sim(a_i^k, a_i^{k'}).

Step 3. Estimate the global workers' reliabilities β\beta by iteratively performing two steps:

  1. For each task, estimate the aggregated embedding:

    e^i=kγikβkeikkγikβk\hat{e}_i = \frac{\sum_k \gamma_i^k \beta_k e_i^k}{\sum_k \gamma_i^k \beta_k}.

  2. For each worker, estimate the global reliability:

    βk=χ(α/2,Vk)2i(eike^i2/γik)\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2/\gamma_i^k\right)}, where Vk\mathcal{V}_k is a set of tasks completed by the worker kk.

Step 4. Estimate the aggregated result. It is the output which embedding is the closest one to e^i\hat{e}_i. If calculate_ranks is true, the method also calculates ranks for each worker response as

sik=βkexp(eike^i2eik2e^i2)+γiks_i^k = \beta_k \exp\left(-\frac{\|e_i^k - \hat{e}_i\|^2}{\|e_i^k\|^2\|\hat{e}_i\|^2}\right) + \gamma_i^k.

Jiyi Li. Crowdsourced Text Sequence Aggregation based on Hybrid Reliability and Representation.

In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20), China (July 25–30, 2020), 1761-1764.

Parameters description


The maximum number of iterations.


The tolerance stopping criterion for iterative methods with a variable number of steps. The algorithm converges when the loss change is less than the tol parameter.


The weight of reliability calculated on embeddings.


The weight of reliability calculated on outputs.


The significance level of the chi-squared distribution quantiles in the β\beta parameter formula.


Specifies if the additional ranks_ attribute will be calculated (true) or not (false).


The similarity measure simsim of the output data. By default, it is equal to the GLEU similarity.


The task embeddings and outputs. The pandas.DataFrame data is indexed by task and has the embedding and output columns.


A list of loss values during training.


import numpy as np
import pandas as pd
from crowdkit.aggregation import HRRASA
df = pd.DataFrame(
['t1', 'p1', 'a', np.array([1.0, 0.0])],
['t1', 'p2', 'a', np.array([1.0, 0.0])],
['t1', 'p3', 'b', np.array([0.0, 1.0])]
columns=['task', 'worker', 'output', 'embedding']
result = HRRASA().fit_predict(df)

Methods summary

fitFits the model to the training data.
fit_predictFits the model to the training data and returns the aggregated outputs.
fit_predict_scoresFits the model to the training data and returns the estimated scores.

Last updated: March 31, 2023