DawidSkene

crowdkit.aggregation.classification.dawid_skene.DawidSkene | Source code

DawidSkene(
self,
n_iter: int = 100,
tol: float = 1e-05
)

The Dawid-Skene aggregation model is a probabilistic model that parametrizes the expertise level of workers with confusion matrices.

Let ewe^w be a worker confusion (error) matrix of size K×KK \times K in case of the KK class classification, pp be a vector of prior class probabilities, zjz_j be a true task label, and yjwy^w_j be a worker response to the task jj. The relationship between these parameters is represented by the following latent label model.

Dawid-Skene latent label model

Here the prior true label probability is

Pr(zj=c)=p[c]\operatorname{Pr}(z_j = c) = p[c],

and the probability distribution of the worker responses with the true label cc is represented by the corresponding column of the error matrix:

Pr(yjw=kzj=c)=ew[k,c]\operatorname{Pr}(y_j^w = k | z_j = c) = e^w[k, c]

Parameters pp, ewe^w, and latent variables zz are optimized with the Expectation-Maximization algorithm:

  1. E-step. Estimates the true task label probabilities using the specified workers' responses, the prior label probabilities, and the workers' error probability matrix.
  2. M-step. Estimates the workers' error probability matrix using the specified workers' responses and the true task label probabilities.

A. Philip Dawid and Allan M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.

Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, 1 (1979), 20–28.

https://doi.org/10.2307/2346806

Parameters description

ParametersTypeDescription
n_iterint

The maximum number of EM iterations.

tolfloat

The tolerance stopping criterion for iterative methods with a variable number of steps. The algorithm converges when the loss change is less than the tol parameter.

labels_Optional[Series]

The task labels. The pandas.Series data is indexed by task so that labels.loc[task] is the most likely true label of tasks.

probas_Optional[DataFrame]

The probability distributions of task labels. The pandas.DataFrame data is indexed by task so that result.loc[task, label] is the probability that the task true label is equal to label. Each probability is in the range from 0 to 1, all task probabilities must sum up to 1.

priors_Optional[Series]

The prior label distribution. The pandas.Series data is indexed by label and contains the probability of the corresponding label occurrence. Each probability is in the range from 0 to 1, all probabilities must sum up to 1.

errors_Optional[DataFrame]

The workers' error matrices. The pandas.DataFrame data is indexed by worker and label with a column for every label_id found in data so that result.loc[worker, observed_label, true_label] is the probability that worker produces observed_label, given that the task true label is true_label.

loss_history_List[float]

A list of loss values during training.

Examples:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset
df, gt = load_dataset('relevance-2')
ds = DawidSkene(100)
result = ds.fit_predict(df)

Methods summary

MethodDescription
fitFits the model to the training data with the EM algorithm.
fit_predictFits the model to the training data and returns the aggregated results.
fit_predict_probaFits the model to the training data and returns probability distributions of labels for each task.

Last updated: March 31, 2023

Crowd-Kit
Overview
Reference
Aggregation
Datasets
Learning
Metrics
Postprocessing