Common use of Source Data Clause in Contracts

Source Data. We use all documents in the ‘archaeology’ category in the 2016 version of the Data Archiving and Networked Services (DANS) repository, one of the largest Dutch e-depots. This data set consists of just over 65,000 files, all of which are in PDF format. Examples of included files – based on document titles – are (excavation) reports, publications, separate appendices and figures, letters, and metadata. Although we have not statistically tested the representativeness of this data set, it represents almost all the output of commercial archaeology units from the last 30 years or so, spanning all time periods, site types and different types of reports. Quite often reports have been split into multiple PDFs, one file for each chapter and appendix is quite common for longer reports. For our research, ▇▇▇▇▇ already provides a collection in which all files have been converted to both XML and raw text format, which allows for the use of information retrieval and text classification. In this research, we only use the raw text files, which have been created using the pdftotext software (Glyph & Cog LLC, 1996). We see that the conversion of the PDF files to the required text format intro- duced a lot of noise. This includes headers, page numbering and various indices appearing at random positions in the text. The main culprits are tables and figures, which are no longer recognisable after conversion. Brandsen et al. (2019) estimate that around 15% of all documents are OCRed, a process likely to intro- duce noise even before the PDF to text conversion. Luckily, this percentage will only decrease, as more and more born digital documents are added over time.

Appears in 2 contracts

Sources: License Agreement Concerning Inclusion of Doctoral Thesis in the Institutional Repository of the University of Leiden, Doctoral Thesis