HRRASA

crowdkit.aggregation.embeddings.hrrasa.HRRASA | Source code

HRRASA(
self,
n_iter: int = 100,
tol: float = 1e-09,
lambda_emb: float = 0.5,
lambda_out: float = 0.5,
alpha: float = 0.05,
calculate_ranks: bool = False,
output_similarity: Callable[[str, List[List[str]]], float] = glue_similarity
)

The Hybrid Reliability and Representation Aware Sequence Aggregation (HRRASA) algorithm consists of four steps.

Step 1. Encode the worker answers into embeddings.

Step 2. Estimate the local workers' reliabilities that represent how well a worker responds to one particular task. The local reliability of the worker kk on the task ii is denoted by γik\gamma_i^k and is calculated by incorporating both types of representations:

γik=λembγi,embk+λseqγi,seqk,  λemb+λseq=1\gamma_i^k = \lambda_{emb}\gamma_{i,emb}^k + \lambda_{seq}\gamma_{i,seq}^k, \; \lambda_{emb} + \lambda_{seq} = 1,

where the γi,embk\gamma_{i,emb}^k value is a reliability calculated on embedding, and the γi,seqk\gamma_{i,seq}^k value is a reliability calculated on output.

The γi,embk\gamma_{i,emb}^k value is calculated by the following equation:

γi,embk=1Ui1aikUi,kkexp(eikeik2eik2eik2)\gamma_{i,emb}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'} \exp\left(\frac{\|e_i^k-e_i^{k'}\|^2}{\|e_i^k\|^2\|e_i^{k'}\|^2}\right),

where Ui\mathcal{U_i} is a set of workers' responses on task ii.

The γi,seqk\gamma_{i,seq}^k value uses some similarity measure simsim on the output data, e.g. GLEU similarity on texts:

γi,seqk=1Ui1aikUi,kksim(aik,aik)\gamma_{i,seq}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'}sim(a_i^k, a_i^{k'}).

Step 3. Estimate the global workers' reliabilities β\beta by iteratively performing two steps:

  1. For each task, estimate the aggregated embedding:

    e^i=kγikβkeikkγikβk\hat{e}_i = \frac{\sum_k \gamma_i^k \beta_k e_i^k}{\sum_k \gamma_i^k \beta_k}.

  2. For each worker, estimate the global reliability:

    βk=χ(α/2,Vk)2i(eike^i2/γik)\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2/\gamma_i^k\right)}, where Vk\mathcal{V}_k is a set of tasks completed by the worker kk.

Step 4. Estimate the aggregated result. It is the output which embedding is the closest one to e^i\hat{e}_i. If calculate_ranks is true, the method also calculates ranks for each worker response as

sik=βkexp(eike^i2eik2e^i2)+γiks_i^k = \beta_k \exp\left(-\frac{\|e_i^k - \hat{e}_i\|^2}{\|e_i^k\|^2\|\hat{e}_i\|^2}\right) + \gamma_i^k.

Jiyi Li. Crowdsourced Text Sequence Aggregation based on Hybrid Reliability and Representation.

In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20), China (July 25–30, 2020), 1761-1764.

https://doi.org/10.1145/3397271.3401239

Parameters description

ParametersTypeDescription
n_iterint

The maximum number of iterations.

tolfloat

The tolerance stopping criterion for iterative methods with a variable number of steps. The algorithm converges when the loss change is less than the tol parameter.

lambda_embfloat

The weight of reliability calculated on embeddings.

lambda_outfloat

The weight of reliability calculated on outputs.

alphafloat

The significance level of the chi-squared distribution quantiles in the β\beta parameter formula.

calculate_ranksbool

Specifies if the additional ranks_ attribute will be calculated (true) or not (false).

_output_similarity-

The similarity measure simsim of the output data. By default, it is equal to the GLEU similarity.

embeddings_and_outputs_-

The task embeddings and outputs. The pandas.DataFrame data is indexed by task and has the embedding and output columns.

loss_history_List[float]

A list of loss values during training.

Examples:

import numpy as np
import pandas as pd
from crowdkit.aggregation import HRRASA
df = pd.DataFrame(
[
['t1', 'p1', 'a', np.array([1.0, 0.0])],
['t1', 'p2', 'a', np.array([1.0, 0.0])],
['t1', 'p3', 'b', np.array([0.0, 1.0])]
],
columns=['task', 'worker', 'output', 'embedding']
)
result = HRRASA().fit_predict(df)

Methods summary

MethodDescription
fitFits the model to the training data.
fit_predictFits the model to the training data and returns the aggregated outputs.
fit_predict_scoresFits the model to the training data and returns the estimated scores.

Last updated: March 31, 2023

Crowd-Kit
Overview
Reference
Aggregation
Datasets
Learning
Metrics
Postprocessing