Named Entity Recognition. NER is the process of finding different categories of named entities (or concepts) in text. Quite often, the categories of entities are persons, organisations, locations, time periods and quantities, as defined in CoNLL-2002, the most used NER benchmark (▇▇▇▇▇ ▇▇▇ Sang, 2002). For archaeology, these entities are not as relevant, with the exception of time periods and locations. In this study, we focus on the following entity types: Table 3.1 in the next chapter gives more formal definitions of these entities and some examples. But why is NER relevant for searching in archaeological texts, and why is a standard free text search not sufficient? In one of the previous sections, we already mentioned polysemy and synonymy, which are the main reason why NER can help us find relevant documents. Polysemy is the phenomenon of one word having multiple meanings. An example is the word “flint”. This can mean the material flint, or a person with the surname ▇▇▇▇▇. In Dutch archaeology, a good example is “Swifterbant”, which can mean either an excavation event, a type of pottery, a time period, or a place in The Netherlands. A standard free text search would return results about all of these meanings, but if we know which meaning a user is looking for, and we can detect the meaning in the documents, then we can return more relevant results. We can use NER to disambiguate between these meanings in the documents. Synonymy is the other way around: a concept that can be described by many different words. An example is the location Den ▇▇▇▇, which can also be written as ’s Gravenhage and The Hague. While synonymy occurs in all six entity types described above, it is only a major challenge for time periods. There are countless
Appears in 1 contract
Sources: Doctoral Thesis
Named Entity Recognition. NER is the process of finding different categories of named entities (or concepts) in text. Quite often, the categories of entities are persons, organisations, locations, time periods and quantities, as defined in CoNLL-2002, the most used NER benchmark (▇▇▇▇▇ ▇▇▇ Sang, 2002). For archaeology, these entities are not as relevant, with the exception of time periods and locations. In this study, we focus on the following entity types: Table 3.1 in the next chapter gives more formal definitions of these entities and some examples. But why is NER relevant for searching in archaeological texts, and why is a standard free text search not sufficient? In one of the previous sections, we already mentioned polysemy and synonymy, which are the main reason why NER can help us find relevant documents. Polysemy is the phenomenon of one word having multiple meanings. An example is the word “flint”. This can mean the material flint, or a person with the surname ▇▇▇▇▇. In Dutch archaeology, a good example is “Swifterbant”, which can mean either an excavation event, a type of pottery, a time period, or a place in The Netherlands. A standard free text search would return results about all of these meanings, but if we know which meaning a user is looking for, and we can detect the meaning in the documents, then we can return more relevant results. We can use NER to disambiguate between these meanings in the documents. Synonymy is the other way around: a concept that can be described by many different words. An example is the location Den ▇▇▇▇, which can also be written as ’s Gravenhage and The Hague. While synonymy occurs in all six entity types described above, it is only a major challenge for time periods. There are countlesscountless ways in which we can describe e.g. the Neolithic, or periods and years within the Neolithic. To name a few examples: • 5693 ± 26 BP (a carbon dating date) But when an archaeologist searches for the Neolithic, ideally they would want all mentions of a date or period within the Neolithic to be returned, and not just the documents that literally contain the word “Neolithic”. If we want to be able to do this, we first need to find all mentions of time periods in the reports, which is where we can use NER. Once we have a list of time periods for each document, we can translate these mentions to year ranges using a thesaurus of time periods and a rule-based approach for dates and years. So we can translate “Funnelbeaker culture” to the year range -4350 to -2700, and “4th to 3nd millenium B.C.” into the range -4000 to -2000. Users can then search on specific date ranges, or we can translate their query of “Neolithic” to a year range, and find all mentions of time spans that fall within that range. This way we can find more relevant results in the document collection. Another concept that warrants explaining in the context of NER are tokens. A token is an instance of a sequence of characters that are grouped together as a useful unit for processing (▇▇▇▇▇▇▇ et al., 2008). Tokens are similar to words, and a token often is a word, but not always. We can illustrate this with the following sentence: “We didn’t find any ‘Swifterbant’ pottery in pit 1, 2 and 3.”. When this sentence is converted into tokens, in a process called tokenisation, we find the following tokens, here separated by spaces: We didn ’ t find any ‘ Swifterbant ’ pottery in pit 1 , 2 and 3 . As we can see, most of these tokens are indeed words, but punctuation marks have also become individual tokens and “didn’t” has been converted to three separate tokens. This tokenisation process is important as it removes noise (such as the quotes around Swifterbant) and turns sentences into chunks that can be processed further. Also, specifically for NER, predictions are done at a token level. This means that for each of these tokens, a prediction is made. This is also reflected in the way NER training data and predictions are gener- ally stored, in the Beginning, Inside, Outside (BIO) format (▇▇▇▇▇▇▇ & ▇▇▇▇▇▇, 1999). This format is most commonly used for sequence labelling tasks such as NER. The file format is a simple text file, with each token on one line, followed by a space and the label. Sentence boundaries are denoted by a double line break. An example is shown below: We O found O a O pottery B-ART shard I-ART from O the O Neolithic B-PER . O Here we see a sentence where ‘pottery’ has been labelled as the start of an Artefact entity, ‘shard’ as inside an Artefact entity, and ‘Neolithic’ labelled as the start of a Time Period entity. The other tokens are labelled O for Outside an entity. Related to tokens are terms, which are all of the tokens that are included in a search engine’s index. Quite often, not all terms are included in an index, for example, very common words such as ‘the’, ‘and’, ‘of’ etc (also called stop words) are removed as they are not useful for searching. Punctuation is also commonly not indexed. Also worth mentioning here are Part Of Speech (POS) tags. A Part Of Speech is a category of words that have similar grammatical properties, such as noun, verb and adjective. These POS tags can be used as a feature in NER, and as such are often saved together with the BIO tags in a file.
Appears in 1 contract
Sources: License Agreement