Motivation. Why does it matter for IR? IR systems often capture associations between entities and/or properties, and depending on the semantic connotations of such relationships they might lead to reinforcing current stereotypes about various groups of people, propagating and amplifying harm. For example, these associations may originate from the data used to train the ranking models, which may not provide enough coverage for all possible associations such that they can all be learned. Certain groups of individuals may be over- or under-represented in the data, which could be a reflection of greater societal disparities (e.g., unequal access to health care can result in unequal representation in health records) or of the types of people who are able to contribute content, including the rate at which these contributions are made (e.g., women tend to be over- represented in Instagram data, but under-represented in StackOverflow data). Representation is also affected by the quality of the tools used to capture the data. For example, it is more difficult to do facial recognition of dark-skinned people in video surveillance footage because of limitations with how cameras are calibrated. As a result, an image retrieval system might fail to properly identify images related to darker-skinned people, while an image assessment system might flag them more often for security interviews, or to scrutinize them in more detail. What makes this specific to IR? Given the ubiquitous usage of IR systems, often broadly construed (e.g., search, recommendation, conversational agents), their impact — negative included — is potentially wide ranging. For instance, research has shown that people trust more sources ranked higher in the search results, but the ranking criteria may rather rely on signals indicative of user satisfaction, than on those indicative of factual information. For consequential searching tasks, such as medical, educational, or financial, this may raise concerns about the trade-offs between satisfying users and providing reliable information. The SIGIR community has the responsibility to address fairness, accountability, confidentiality and transparency in all aspects of research and in the systems built in industry. Similar respon- sibility issues are addressed in related fields, however, there are specific issues in IR stemming from the characteristics of, and reliance on document collections and the often imprecise nature of search and recommendation tasks. IR has a strong history of using test collections during eval- uation. As evaluation tools, test collections also have certain types of bias built-in. For example, the people who construct topics and make relevance assessments arguably are not representative of the larger population. In some cases, they have not been representative of the type of users who are being modeled (e.g., having people who do not read blogs evaluate blogs). Evaluation mea- sures are designed to optimize certain performance criteria and not others, and either implicitly or explicitly have built-in user models. Systems are then tested and tuned within this evaluation framework, further reinforcing and rectifying any existing biases. For example, in building test collections, bias should be avoided by ensuring diversity in the sources of documents included and using people from diverse backgrounds to create topics. What are examples of human, social, and economic impact? Infrastructure and accessi- bility variations may introduce differential representation in training data. For example, research has shown that social media accounts with non-Western names are more likely to be flagged as fraudulent, and argue that this is because the classifiers have been trained on Western names. Bias can also be introduced by the interfaces and tools that are presented to users. For example, query autocompletion is a common feature of search systems that learn suggestions from past behaviors of users; however, often the people who type queries about particular topics are from a specific segment of the population and the intent behind their queries is often unclear. For example, the query prefix “transgenders are ...” results in offensive autocomplete suggestions of “transgenders are freaks” and “transgenders are sick”. Another motivation for this work is the growing concern about the understandability, explain- ability and reliability of deep learning methods including the complexity of the parameter space. These techniques are being used in a variety of domains to assist with a range of high-impact tasks such as in the medical domain for diagnosis and the intelligence community for detecting threats and combating terrorism. Many of the domain experts working in these fields are not satisfied with a simple answer, but rather desire to know about the reasoning and evidence behind the answer that the system produces because the decisions they are making can have significant consequences. Moreover, the engineers who create systems often do not understand which parts of the system are responsible for failures, and it can be difficult to trace the origins of errors in such complex parameter spaces. However, it is unclear how such explanations, evidence-trails and provenance might be communicated to the various user groups and how such communications might change behaviors, and the quality, quantity, and nature of human-computer interaction. We, the IR community, should take the initiative before others do in the face of changing legal frameworks. For example, in Europe with the General Data Protection Regulation, individuals have a right to erasure of personal information and a right of explanation. IR systems need to incorporate these rights. Thus, an indexing scheme needs to be able to delete information and search results may require explanation.
Appears in 3 contracts
Sources: End User Agreement, End User Agreement, End User Agreement