Test Corpus. We collected names from two publicly available sources. The first is the Death Master File, published by the Social Security Administration, which contains the names of about 77 million deceased holders of social security numbers1. Although limited to the United States, the data source is large enough so as to contain names from a variety of linguistic and cultural origins. The second source is the Mémoire des hommes, published by the French government, which lists the names of about 1.3 million deceased soldiers from 20th century wars, including Indochina and North Africa2. As such it contains not only French names, but also Southeast Asian and Francophone-transliterated Arabic names. Using a commercial name culture classification tool, 70,000 names were chosen with a stratified cultural distribution, including Anglo, Arabic, Hispanic, Chinese, Korean, Russian, Southwest Asian (Farsi, Afghani, and linguistic origin, with performance evaluated using a balanced F-score (F1). 1 ▇▇▇▇://▇▇▇.▇▇▇▇.▇▇▇/products/ssa-dmf.asp. We would like to acknowledge ▇▇▇▇▇▇▇▇▇ ▇▇▇▇ for identifying this data source. 2 ▇▇▇▇://▇▇▇.▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇.▇▇▇.▇▇▇▇▇▇▇.▇▇▇▇.▇▇/ Pakistani), French, German, Indian, Japanese, and Vietnamese. Additionally we manually created 1,146 variants on 404 (about 0.6%) of the base records, averaging 2.8 variants per record. Because it is infeasible to adjudicate the results of matching the entire list against itself, we chose a subset of 700 as queries. The queries come from two groups: the 404 “base” records, and randomly selected records. Of these 700 queries that were used in a larger evaluation, 100 were randomly selected for this study.
Appears in 2 contracts
Sources: Adjudicator Agreement, Adjudicator Agreement