Data set Collection Sample Clauses
Data set Collection. From the total available corpus (70k documents), we currently have access to ~60,000 excavation reports and related documents, such as appendices, drawings and maps. These texts have been gathered by DANS (Digital Archiving and Networked Services) in the Netherlands, over the past 20 years. We received the documents from DANS as PDF files, and have used the pdftotext tool (Glyph & Cog LLC, 1996) to convert these to plain text. This data set contains 30,152,318 lines and 657,808,600 words (as counted by the command line tool “wc”). The texts are quite diverse; the dates of publication span decades with the earlier ones having been scanned and OCRd from hardcopies created in the 80s. The other temporal variation is in how old the found artefacts are, ranging from 200,000 BC to the present. Also, the type of research can be very different between reports, some might describe a short desk evaluation of a small area without any fieldwork, while others detail huge excavations over multiple years with detailed analysis by a team of specialists. To get a representative sample across all these ranges, a random sampling strategy would not be ideal, and we instead opted to manually select documents, taking into account the variation described above. We selected a total of 15 documents as annotation candidates (~42,000 tokens). For the purposes of calculating the IAA and evaluating the annotation guide- lines, we manually selected roughly 100 sentences from these documents contain- ing all the entity types (Table 3.1, explained below) and specific difficult cases as validation set, annotated by all annotators. Artefact An archaeological object found in the ground. Axe, pot, stake, arrow head, coin Time Period A defined (archaeological) period in time. Middle Ages, Neolithic, 500 BC, 4000 BP Location A placename or (part of) an address. Amsterdam, ▇▇▇▇▇- ▇▇▇▇▇▇ ▇, ▇▇▇▇▇▇▇▇▇▇ Context An anthropogenic, definable part of a stratigraphy. Something that can contain Artefacts Rubbish pit, burial mound, stake hole Material The material an Artefact is made of. Bronze, wood, flint, glass Species A species’ name (in Latin or Dutch) Cow, Corvus Corax, oak Table 3.1: Descriptions and examples for each entity type. Examples are trans- lated from Dutch.
Data set Collection. From the total available corpus (70k documents), we currently have access to ~60,000 excavation reports and related documents, such as appendices, drawings and maps. These texts have been gathered by DANS (Digital Archiving and Networked Services) in the Netherlands, over the past 20 years. We received the documents from DANS as PDF files, and have used the pdftotext tool (Glyph & Cog LLC, 1996) to convert these to plain text. This data set contains 30,152,318 lines and 657,808,600 words (as counted by the command line tool “wc”). The texts are quite diverse; the dates of publication span decades with the earlier ones having been scanned and OCRd from hardcopies created in the 80s. The other temporal variation is in how old the found artefacts are, ranging from 200,000 BC to the present. Also, the type of research can be very different between reports, some might describe a short desk evaluation of a small area without any fieldwork, while others detail huge excavations over multiple years with detailed analysis by a team of specialists. To get a representative sample across all these ranges, a random sampling strategy would not be ideal, and we instead opted to manually select documents, taking into account the variation described above. We selected a total of 15 documents as annotation candidates (~42,000 tokens). For the purposes of calculating the IAA and evaluating the annotation guide- lines, we manually selected roughly 100 sentences from these documents contain- ing all the entity types (Table 3.1, explained below) and specific difficult cases as validation set, annotated by all annotators.
