HRRASA

crowdkit.aggregation.embeddings.hrrasa.HRRASA | Source code

HRRASA(
self,
n_iter: int = 100,
tol: float = 1e-09,
lambda_emb: float = 0.5,
lambda_out: float = 0.5,
alpha: float = 0.05,
calculate_ranks: bool = False,
output_similarity: Callable[[str, List[List[str]]], float] = glue_similarity
)

Hybrid Reliability and Representation Aware Sequence Aggregation.

At the first step, the HRRASA estimates local workers reliabilities that represent how good is a worker's answer to one particular task. The local reliability of the worker kk on the task ii is denoted by γik\gamma_i^k and is calculated as a sum of two terms:

γik=λembγi,embk+λoutγi,outk,  λemb+λout=1\gamma_i^k = \lambda_{emb}\gamma_{i,emb}^k + \lambda_{out}\gamma_{i,out}^k, \; \lambda_{emb} + \lambda_{out} = 1.

The γi,embk\gamma_{i,emb}^k is a reliability calculated on embedding and the γi,seqk\gamma_{i,seq}^k is a reliability calculated on output.

The γi,embk\gamma_{i,emb}^k is calculated by the following equation:

γi,embk=1Ui1aikUi,kkexp(eikeik2eik2eik2)\gamma_{i,emb}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'}\exp\left(\frac{\|e_i^k-e_i^{k'}\|^2}{\|e_i^k\|^2\|e_i^{k'}\|^2}\right),

where Ui\mathcal{U_i} is a set of workers' responses on task ii. The γi,outk\gamma_{i,out}^k makes use of some similarity measure simsim on the output data, e.g. GLUE similarity on texts:

γi,outk=1Ui1aikUi,kksim(aik,aik)\gamma_{i,out}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'}sim(a_i^k, a_i^{k'}).

The HRRASA also estimates global workers' reliabilities β\beta that are initialized by ones.

Next, the algorithm iteratively performs two steps:

  1. For each task, estimate the aggregated embedding: e^i=kγikβkeikkγikβk\hat{e}_i = \frac{\sum_k \gamma_i^k \beta_k e_i^k}{\sum_k \gamma_i^k \beta_k}

  2. For each worker, estimate the global reliability: βk=χ(α/2,Vk)2i(eike^i2/γik)\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2/\gamma_i^k\right)}, where Vk\mathcal{V}_k is a set of tasks completed by the worker kk

Finally, the aggregated result is the output which embedding is the closest one to the e^i\hat{e}_i. If calculate_ranks is true, the method also calculates ranks for each workers' response as

sik=βkexp(eike^i2eik2e^i2)+γiks_i^k = \beta_k \exp\left(-\frac{\|e_i^k - \hat{e}_i\|^2}{\|e_i^k\|^2\|\hat{e}_i\|^2}\right) + \gamma_i^k.

Jiyi Li. Crowdsourced Text Sequence Aggregation based on Hybrid Reliability and Representation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), July 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA,

https://doi.org/10.1145/3397271.3401239

Parameters Description

ParametersTypeDescription
n_iterint

A number of iterations.

lambda_embfloat

A weight of reliability calculated on embeddigs.

lambda_outfloat

A weight of reliability calculated on outputs.

alphafloat

Confidence level of chi-squared distribution quantiles in beta parameter formula.

calculate_ranksbool

If true, calculate additional attribute ranks_.

Examples:

import numpy as np
import pandas as pd
from crowdkit.aggregation import HRRASA
df = pd.DataFrame(
[
['t1', 'p1', 'a', np.array([1.0, 0.0])],
['t1', 'p2', 'a', np.array([1.0, 0.0])],
['t1', 'p3', 'b', np.array([0.0, 1.0])]
],
columns=['task', 'worker', 'output', 'embedding']
)
result = HRRASA().fit_predict(df)

Methods Summary

MethodDescription
fitFit the model.
fit_predictFit the model and return aggregated outputs.
fit_predict_scoresFit the model and return scores.
Crowd-Kit
Overview
Reference
Aggregation
Datasets
Learning
Metrics
Postprocessing