Common use of Data Set Clause in Contracts

Data Set. In this paper, we present the development of a training data set for Dutch Named Entity Recognition (NER) in the archaeology domain. This data set was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting data set contains roughly 31k annotations between six entity types (artefact, time period, place, context, species & material). The Inter Annotator Agreement (IAA) is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a data set created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.

Appears in 2 contracts

Sources: License Agreement, Doctoral Thesis