Toloka documentation

RASA

crowdkit.aggregation.embeddings.rasa.RASA | Source code

RASA(
    self,
    n_iter: int = 100,
    tol: float = 1e-09,
    alpha: float = 0.05
)

Reliability Aware Sequence Aggregation.

RASA estimates global workers' reliabilities β\beta that are initialized by ones.

Next, the algorithm iteratively performs two steps:

  1. For each task, estimate the aggregated embedding: e^i=kβkeikkβk\hat{e}_i = \frac{\sum_k \beta_k e_i^k}{\sum_k \beta_k}
  2. For each worker, estimate the global reliability: βk=χ(α/2,Vk)2i(eike^i2)\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2\right)}, where Vk\mathcal{V}_k is a set of tasks completed by the worker kk

Finally, the aggregated result is the output which embedding is the closest one to the e^i\hat{e}_i.

Jiyi Li. A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP, pages 24–28 Hong Kong, China, November 3, 2019. https://doi.org/10.18653/v1/D19-5904

Parameters Description

Parameters Type Description
n_iter int

A number of iterations.

alpha float

Confidence level of chi-squared distribution quantiles in beta parameter formula.

embeddings_and_outputs_ DataFrame

Tasks' embeddings and outputs. A pandas.DataFrame indexed by task with embedding and output columns.

Examples:

import numpy as np
import pandas as pd
from crowdkit.aggregation import RASA
df = pd.DataFrame(
    [
        ['t1', 'p1', 'a', np.array([1.0, 0.0])],
        ['t1', 'p2', 'a', np.array([1.0, 0.0])],
        ['t1', 'p3', 'b', np.array([0.0, 1.0])]
    ],
    columns=['task', 'worker', 'output', 'embedding']
)
result = RASA().fit_predict(df)

Methods Summary

Method Description
fit Fit the model.
fit_predict Fit the model and return aggregated outputs.
fit_predict_scores Fit the model and return scores.