a1 CLiC, University of Barcelona, Gran Via 585, Barcelona 08007, Spain email: email@example.com
a2 USC Information Sciences Institute, 4676 Admiralty Way, Marina del Rey, CA 90292, USA email: firstname.lastname@example.org
This paper addresses the current state of coreference resolution evaluation, in which different measures (notably, MUC, B3, CEAF, and ACE-value) are applied in different studies. None of them is fully adequate, and their measures are not commensurate. We enumerate the desiderata for a coreference scoring measure, discuss the strong and weak points of the existing measures, and propose the BiLateral Assessment of Noun-Phrase Coreference, a variation of the Rand index created to suit the coreference task. The BiLateral Assessment of Noun-Phrase Coreference rewards both coreference and non-coreference links by averaging the F-scores of the two types, does not ignore singletons – the main problem with the MUC score – and does not inflate the score in their presence – a problem with the B3 and CEAF scores. In addition, its fine granularity is consistent over the whole range of scores and affords better discrimination between systems.
(Received May 03 2010)
(Revised August 17 2010)
(Accepted October 28 2010)
(Online publication December 06 2010)