DawidSkene

crowdkit.aggregation.classification.dawid_skene.DawidSkene | Source code

DawidSkene(
self,
n_iter: int = 100,
tol: float = 1e-05
)

Dawid-Skene aggregation model.

Probabilistic model that parametrizes workers' level of expertise through confusion matrices.

Let ewe^w be a worker's confusion (error) matrix of size K×KK \times K in case of KK class classification,

pp be a vector of prior classes probabilities, zjz_j be a true task's label, and yjwy^w_j be a worker's answer for the task jj. The relationships between these parameters are represented by the following latent label model.
Dawid-Skene latent label model

Here the prior true label probability is

Pr(zj=c)=p[c]\operatorname{Pr}(z_j = c) = p[c],

and the distribution on the worker's responses given the true label cc is represented by the corresponding column of the error matrix:

Pr(yjw=kzj=c)=ew[k,c]\operatorname{Pr}(y_j^w = k | z_j = c) = e^w[k, c]

Parameters pp and ewe^w and latent variables zz are optimized through the Expectation-Maximization algorithm.

A. Philip Dawid and Allan M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, 1 (1979), 20–28.

https://doi.org/10.2307/2346806

Parameters Description

ParametersTypeDescription
n_iterint

The number of EM iterations.

labels_Optional[Series]

Tasks' labels. A pandas.Series indexed by task such that labels.loc[task] is the tasks's most likely true label.

probas_Optional[DataFrame]

Tasks' label probability distributions. A pandas.DataFrame indexed by task such that result.loc[task, label] is the probability of task's true label to be equal to label. Each probability is between 0 and 1, all task's probabilities should sum up to 1

priors_Optional[Series]

A prior label distribution. A pandas.Series indexed by labels and holding corresponding label's probability of occurrence. Each probability is between 0 and 1, all probabilities should sum up to 1

errors_Optional[DataFrame]

Workers' error matrices. A pandas.DataFrame indexed by worker and label with a column for every label_id found in data such that result.loc[worker, observed_label, true_label] is the probability of worker producing an observed_label given that a task's true label is true_label

Examples:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset
df, gt = load_dataset('relevance-2')
ds = DawidSkene(100)
result = ds.fit_predict(df)

Methods Summary

MethodDescription
fitFit the model through the EM-algorithm.
fit_predictFit the model and return aggregated results.
fit_predict_probaFit the model and return probability distributions on labels for each task.
Crowd-Kit
Overview
Reference
Aggregation
Datasets
Learning
Metrics
Postprocessing