When I joined WhizBang!Labs, one of the most impressive sights was the room full of labelers - part time workers whose job was to create labeled data. This data was used to both train and test classifiers - supervised machine learning. One of the things one rapidly learns is the value of good labeled data. Part of creating a labeling exercise is to develop good criteria for determining the classes - the labels that need to be given to the data. Even with good criteria, it is always quite remarkable to see how often a room full of people can disagree.
In a recent paper by Alm, Roth and Sproat: 'Emotions from text: machine learning for text-based emotion predictions', a pair of labelers were given the task of identifying the emotion found in the text of children's fairy tales. They report that the labelers agreed between 45 and 64 % of the time. In other words, when asked to determine if a piece of text indicated ANGER, DISGUST, FEAR, HAPPINESS, SADNESS, SURPRISE or no emotion, there was only moderate agreement.
Having multiple annotators, and the ability to detect the intersection of their decisions, is part of an established process by which a gold standard is reached. Those labels which are not common to all (both) labelers are then reviewed and resolved. Emotion may be one of the hardest things to label for. Defining emotions is tricky and penetrating the textual fog that mixes both direct and indirect emotional cues as well as the use of metaphor (does I loved that movie really mean I had an emotional relationship with it?) is highly valuable yet challenging.