Results and Discussion. Table 1 (top) shows the root mean square error (RMSE) between the three tests for diﬀerent numbers of topics. These results show that all three tests largely agree with each other but as the sample size (number of topics) decreases, the agreement decreases. In line with the results found for 50 topics, the randomization and bootstrap tests agree more with the t-test than with each other. We looked at pairwise scatterplots of the three tests at the diﬀerent topic sizes. While there is some disagreement among the tests at large p-values, i.e. those greater than 0.5, none of the tests would predict such a run pair to have a signiﬁcant diﬀerence. More interesting to us is the behavior of the tests for run pairs with lower p-values. ≥ Table 1 (bottom) shows the RMSE among the three tests for run pairs that all three tests agreed had a p-value greater than 0.0001 and less than 0.5. In contrast to all pairs with p-values 0.0001 (Table 1 top), these run pairs are of more importance to the IR researcher since they are the runs that require a statistical test to judge the signiﬁcance of the per- formance diﬀerence. For these run pairs, the randomization and t tests are much more in agreement with each other than the bootstrap is with either of the other two tests. Looking at scatterplots, we found that the bootstrap tracks the t-test very well but shows a systematic bias to produce p-values smaller than the t-test. As the number of topics de- creases, this bias becomes more pronounced. Figure 1 shows a pairwise scatterplot of the three tests when the number of topics is 10. The randomization test also tends to produce smaller p-values than the t-test for run pairs where the t- test estimated a p-value smaller than 0.1, but at the same time, produces some p-values greater than the t-test’s. As Figure 1 shows, the bootstrap consistently gives smaller p- values than the t-test for these smaller p-values. While the bootstrap and the randomization test disagree with each other more than with the t-test, Figure 1 shows that for a low number of topics, the randomization test shows less noise in its agreement with the bootstrap com- Figure 1: A pairwise comparison of the p-values less than 0.25 produced by the randomization, t-test, and the bootstrap tests for pairs of TREC runs with only 10 topics. The small number of topics high- lights the diﬀerences between the three tests. pared to the t-test for small p-values.

Appears in 2 contracts

Samples: citeseerx.ist.psu.edu, maroo.cs.umass.edu

Results and Discussion. Table 1 (top) shows the root mean square error (RMSE) between the three tests for ~~diﬀerent~~ different numbers of topics. These results show that all three tests largely agree with each other but as the sample size (number of topics) decreases, the agreement decreases. In line with the results found for 50 topics, the randomization and bootstrap tests agree more with the t-test than with each other. We looked at pairwise scatterplots of the three tests at the ~~diﬀerent~~ different topic sizes. While there is some disagreement among the tests at large p-values, i.e. those greater than 0.5, none of the tests would predict such a run pair to have a ~~signiﬁcant diﬀerence~~significant difference. More interesting to us is the behavior of the tests for run pairs with lower p-values. ≥ Table 1 (bottom) shows the RMSE among the three tests for run pairs that all three tests agreed had a p-value greater than 0.0001 and less than 0.5. In contrast to all pairs with p-values 0.0001 (Table 1 top), these run pairs are of more importance to the IR researcher since they are the runs that require a statistical test to judge the ~~signiﬁcance~~ significance of the per- formance ~~diﬀerence~~difference. For these run pairs, the randomization and t tests are much more in agreement with each other than the bootstrap is with either of the other two tests. Looking at scatterplots, we found that the bootstrap tracks the t-test very well but shows a systematic bias to produce p-values smaller than the t-test. As the number of topics de- creases, this bias becomes more pronounced. Figure 1 shows a pairwise scatterplot of the three tests when the number of topics is 10. The randomization test also tends to produce smaller p-values than the t-test for run pairs where the t- test estimated a p-value smaller than 0.1, but at the same time, produces some p-values greater than the t-test’s. As Figure 1 shows, the bootstrap consistently gives smaller p- values than the t-test for these smaller p-values. While the bootstrap and the randomization test disagree with each other more than with the t-test, Figure 1 shows that for a low number of topics, the randomization test shows less noise in its agreement with the bootstrap com- Figure 1: A pairwise comparison of the p-values less than 0.25 produced by the randomization, t-test, and the bootstrap tests for pairs of TREC runs with only 10 topics. The small number of topics high- lights the ~~diﬀerences~~ differences between the three tests. pared to the t-test for small p-values.

Appears in 2 contracts

Samples: citeseerx.ist.psu.edu, ciir-publications.cs.umass.edu

Common use of Results and Discussion Clause in Contracts