Toloka documentation


crowdkit.aggregation.embeddings.rasa.RASA | Source code

    n_iter: int = 100,
    tol: float = 1e-09,
    alpha: float = 0.05

Reliability Aware Sequence Aggregation.

RASA estimates global workers' reliabilities β\beta that are initialized by ones.

Next, the algorithm iteratively performs two steps:

  1. For each task, estimate the aggregated embedding: e^i=kβkeikkβk\hat{e}_i = \frac{\sum_k \beta_k e_i^k}{\sum_k \beta_k}
  2. For each worker, estimate the global reliability: βk=χ(α/2,Vk)2i(eike^i2)\beta_k = \frac{\chi^2_{(\alpha/2, |\mathcal{V}_k|)}}{\sum_i\left(\|e_i^k - \hat{e}_i\|^2\right)}, where Vk\mathcal{V}_k is a set of tasks completed by the worker kk

Finally, the aggregated result is the output which embedding is the closest one to the e^i\hat{e}_i.

Jiyi Li. A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP, pages 24–28 Hong Kong, China, November 3, 2019.

Parameters Description

Parameters Type Description
n_iter int

A number of iterations.

alpha float

Confidence level of chi-squared distribution quantiles in beta parameter formula.

embeddings_and_outputs_ DataFrame

Tasks' embeddings and outputs. A pandas.DataFrame indexed by task with embedding and output columns.


import numpy as np
import pandas as pd
from crowdkit.aggregation import RASA
df = pd.DataFrame(
        ['t1', 'p1', 'a', np.array([1.0, 0.0])],
        ['t1', 'p2', 'a', np.array([1.0, 0.0])],
        ['t1', 'p3', 'b', np.array([0.0, 1.0])]
    columns=['task', 'worker', 'output', 'embedding']
result = RASA().fit_predict(df)

Methods Summary

Method Description
fit Fit the model.
fit_predict Fit the model and return aggregated outputs.
fit_predict_scores Fit the model and return scores.