Motivation. The classical IR evaluation model was designed to evaluate the performance of the IR system with respect to just one interaction instance: the response that the system provides to one query put to that system. The model has been extended in various ways, to differential effect. Test collections have used a surprisingly wide range of labeling criteria: topical relevance, home-page- for, key page, spam, opinionated, a-venue-I-would-go-to, novelty, and others. ▇▇▇▇▇▇▇▇▇ assumes an atomic preference criterion: that is, an individual document’s preference label is defined with respect to the document and topic only. Atomicity allows us to build test collections scalably because documents can be labeled in a single pass. Other kinds of criteria for building test collections should be explored. For other atomic qualities we need to understand how to define them, how to develop labeling guidelines that are understandable enough for separate sites to label items comparably, how to measure the consistency and reliability of those labels, and how to measure the impact of label disagreements. As research problems these questions deserve more attention. Although there have been serious attempts to design methods to evaluate system support for information search sessions, these have uniformly failed. There are various reasons for this failure. The atomic criterion of relevance, basic to the model, does not easily apply to the evaluation of the success of a whole session, and the presence of human beings, having varied intentions during the information search session, making individual decisions during the search session, and having varied individual characteristics, has made comparability of performance of different systems with different persons, as required by the classic model, seemingly impossible. Extending the Cranfield model into full interactions is hard because it violates the atomicity criterion. To consider an interaction where a user starts from different queries, encounters docu- ments differently, and moves towards completion of the task along multiple paths, a test collection would need, at a minimum, to define the relevance of each document with respect to all docu- ments already seen. Without constraining this within some sort of structure, there would be an exponential number of relevance judgments needed. Taking a further step and allowing the user’s understanding of the task to evolve and criteria for successful completion of that task to change during the interaction adds another exponent.
Appears in 3 contracts
Sources: End User Agreement, End User Agreement, End User Agreement