HRRASA

crowdkit.aggregation.embeddings.hrrasa.HRRASA | Source code

HRRASA(
    self,
    n_iter: int = 100,
    tol: float = 1e-09,
    lambda_emb: float = 0.5,
    lambda_out: float = 0.5,
    alpha: float = 0.05,
    calculate_ranks: bool = False,
    output_similarity: Callable[[str, List[List[str]]], float] = glue_similarity
)

The Hybrid Reliability and Representation Aware Sequence Aggregation (HRRASA) algorithm consists of four steps.

Step 1. Encode the worker answers into embeddings.

Step 2. Estimate the local workers' reliabilities that represent how well a worker responds to one particular task. The local reliability of the worker $k$ on the task $i$ is denoted by $\gamma_i^k$ and is calculated by incorporating both types of representations:

$\gamma_i^k = \lambda_{emb}\gamma_{i,emb}^k + \lambda_{seq}\gamma_{i,seq}^k, \; \lambda_{emb} + \lambda_{seq} = 1$ ,

where the $\gamma_{i,emb}^k$ value is a reliability calculated on embedding, and the $\gamma_{i,seq}^k$ value is a reliability calculated on output.

The $\gamma_{i,emb}^k$ value is calculated by the following equation:

$\gamma_{i,emb}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'} \exp\left(\frac{\|e_i^k-e_i^{k'}\|^2}{\|e_i^k\|^2\|e_i^{k'}\|^2}\right)$ ,

where $\mathcal{U_i}$ is a set of workers' responses on task $i$ .

The $\gamma_{i,seq}^k$ value uses some similarity measure $sim$ on the output data, e.g. GLEU similarity on texts:

$\gamma_{i,seq}^k = \frac{1}{|\mathcal{U}_i| - 1}\sum_{a_i^{k'} \in \mathcal{U}_i, k \neq k'}sim(a_i^k, a_i^{k'})$ .

Step 3. Estimate the global workers' reliabilities $\beta$ by iteratively performing two steps:

For each task, estimate the aggregated embedding:

$\hat{e}_i = \frac{\sum_k \gamma_i^k \beta_k e_i^k}{\sum_k \gamma_i^k \beta_k}$ .
For each worker, estimate the global reliability:

$\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2/\gamma_i^k\right)}$ , where $\mathcal{V}_k$ is a set of tasks completed by the worker $k$ .

Step 4. Estimate the aggregated result. It is the output which embedding is the closest one to $\hat{e}_i$ . If calculate_ranks is true, the method also calculates ranks for each worker response as

$s_i^k = \beta_k \exp\left(-\frac{\|e_i^k - \hat{e}_i\|^2}{\|e_i^k\|^2\|\hat{e}_i\|^2}\right) + \gamma_i^k$ .

Jiyi Li. Crowdsourced Text Sequence Aggregation based on Hybrid Reliability and Representation.

In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20), China (July 25–30, 2020), 1761-1764.

https://doi.org/10.1145/3397271.3401239

Parameters description

Parameters	Type	Description
`n_iter`	int	The maximum number of iterations.
`tol`	float	The tolerance stopping criterion for iterative methods with a variable number of steps. The algorithm converges when the loss change is less than the `tol` parameter.
`lambda_emb`	float	The weight of reliability calculated on embeddings.
`lambda_out`	float	The weight of reliability calculated on outputs.
`alpha`	float	The significance level of the chi-squared distribution quantiles in the $\beta$ parameter formula.
`calculate_ranks`	bool	Specifies if the additional `ranks_` attribute will be calculated (true) or not (false).
`_output_similarity`	-	The similarity measure $sim$ of the `output` data. By default, it is equal to the GLEU similarity.
`embeddings_and_outputs_`	-	The task embeddings and outputs. The `pandas.DataFrame` data is indexed by `task` and has the `embedding` and `output` columns.
`loss_history_`	List[float]	A list of loss values during training.

Examples:

import numpy as np
import pandas as pd
from crowdkit.aggregation import HRRASA
df = pd.DataFrame(
    [
        ['t1', 'p1', 'a', np.array([1.0, 0.0])],
        ['t1', 'p2', 'a', np.array([1.0, 0.0])],
        ['t1', 'p3', 'b', np.array([0.0, 1.0])]
    ],
    columns=['task', 'worker', 'output', 'embedding']
)
result = HRRASA().fit_predict(df)

Methods summary

Method	Description
fit	Fits the model to the training data.
fit_predict	Fits the model to the training data and returns the aggregated outputs.
fit_predict_scores	Fits the model to the training data and returns the estimated scores.

Last updated: March 31, 2023

Crowd-Kit

Overview

Reference

Aggregation

Datasets

Learning

Metrics

Postprocessing