Flickr-8k definition
Flickr-8k. The Flickr-8K dataset contains qual- ity judgements for 5,822 sentences17 (▇▇▇▇▇▇▇ and ▇▇▇▇▇▇, 2014)18. Each sentence was a description of an image. The annotation was carried out by 3 human experts who judged the sentence semantic correctness in a scale from 1 to 4. Because we don’t have the information about how the data were collected, in order to decide which kind of analysis to carry out on the Flickr- 8k dataset we plot the distribution of the categories used by the judges. Figure 3 suggests that the data do not have a normal distribution, and so we opt for the use of nonparametric statistics. As in the previous case we used ▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇’s Gamma and ▇▇▇▇▇▇’ κ to carry out our analysis. For ▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇’s Gamma, we report the average results of the pairwise measure be- tween the annotators. This method is suggested by ▇▇▇▇▇▇ and ▇▇▇▇▇▇▇▇▇ (1988) for the case of ▇▇▇▇▇▇▇ τ correlation coefficient, which is a variant of the ▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇’s Gamma. Figure 3: Distribution of the categories used by the judges in the Flickr-8k dataset. The measurements give a Fleiss’ value of 0.52 and a Gamma value of 0.98. Following the Krip- pendorff interpretation of IAA, the annotation has to be considered not reliable. However, the an- notation achieves a very high correlation, which suggests a high relative consistency between the judges. Indeed, when they are in disagreement, judge 2 ranks systematically higher than judge 1, and judge 3 ranks systematically higher than judge