UCSC-SOE-14-03: Crowdsourcing Quantitative Evaluations: Algorithms and Empirical Results

Luca de Alfaro, Michael Shavlovsky
04/14/2014 04:26 PM
Computer Science
We consider the problem of crowd-sourcing the quantitative evaluation
of the quality of a set of items. Each item is examined and assigned a numerical grade by a small number human evaluators. Evaluators are naturally affected by personal biases, random mistakes, and unequal levels of dedication to the task: our goal is to compute consensus grades that are as precise and as free of biases as
possible. We present two novel algorithms for this problem: one,
nicknamed VariancePropagation, is related to belief propagation and maximum
likelyhood; the other, nicknamed CostOfDisagreement, is based on iterated
convex optimization. Both algorithms implement reputation systems that weigh more the input from more accurate evaluators.

On synthetic data, both algorithms far outperform simple aggregators such as
average. To evaluate the performance of the algorithms in the real world, we used a dataset from a peer-grading tool, consisting in 13136 evaluations of 2018 submissions, collected over 23 homework assignments. As this dataset lacks a ground truth, we introduce and justify the use of stability under subsampling as a measure of algorithm precision that can be used in absence of a ground truth. On this real-world data, we identify a version of VariancePropagation that has superior performance to all other alternatives. We discuss the aspects and adaptations of the algorithms that make them well-suited to real-world use.