Methods and Materials. For each of the 18820 pairs of the ad-hoc retrieval runs of TREC 3, 5–8, we computed the two-sided statistical sig- nificance (p-value) of the difference in the pair’s mean aver- age precision using each of three tests: the randomization, shifted bootstrap, and Student’s paired t-test. Both the ran- domization and bootstrap are distribution-free tests. Space limitations prevent us from explaining the details of each of these well-known tests. For both the randomization and bootstrap, we performed 100,000 samples. For each pair of runs, we sampled topics without replacement to produce runs with 10, 20, 30, and 40 topics. To compare significance tests, we computed the root mean square error between each test and each other test’s p-values. The root mean square error is: Copyright is held by the author/owner(s). 1 ΣN (Ei − Oi)2 1/2 ACM 978-1-60558-483-6/09/07. N i Pairs of TREC runs with p-values ≥ 0.0001 Number of Topics 50 40 30 20 10 rand. vs. t-test 0.007 0.009 0.011 0.018 0.037 boot. vs. t-test 0.007 0.009 0.011 0.017 0.035 boot. vs. rand. 0.011 0.014 0.017 0.026 0.051 Run pairs with p-value p such that 0.0001 < p < 0.5 rand. vs. t-test 0.005 0.006 0.008 0.012 0.027 boot. vs. t-test 0.008 0.010 0.013 0.020 0.041 boot. vs. rand. 0.010 0.013 0.016 0.024 0.047 Table 1: The root mean square error among the ran- domization (rand.), t-test, and the bootstrap (boot.) test’s p-values for pairs of TREC runs such that all three tests agree that the p-value p is 0.0001 (top) and 0.0001 < p < 0.5 (bottom). where Ei is the estimated p-value given by one test and Oi is the other test’s p-value.
Appears in 2 contracts