Natural Language Engineering


A cross-corpus study of subjectivity identification using unsupervised learning


a1 Department of Computer Science, The University of Texas at Dallas, 800 West Campbell Road, Richardson, Texas e-mail:,


In this study, we investigate using unsupervised generative learning methods for subjectivity detection across different domains. We create an initial training set using simple lexicon information and then evaluate two iterative learning methods with a base naive Bayes classifier to learn from unannotated data. The first method is self-training, which adds instances with high confidence into the training set in each iteration. The second is a calibrated EM (expectation-maximization) method where we calibrate the posterior probabilities from EM such that the class distribution is similar to that in the real data. We evaluate both approaches on three different domains: movie data, news resource, and meeting dialogues, and we found that in some cases the unsupervised learning methods can achieve performance close to the fully supervised setup. We perform a thorough analysis to examine factors, such as self-labeling accuracy of the initial training set in unsupervised learning, the accuracy of the added examples in self-training, and the size of the initial training set in different methods. Our experiments and analysis show inherent differences across domains and impacting factors explaining the model behaviors.

(Received October 25 2010)

(Revised April 27 2011)

(Accepted June 06 2011)

(Online publication August 16 2011)


† The authors thank Theresa Wilson for sharing annotation for the AMI corpus and helping with data processing for that data, and the three reviewers for their useful comments.