The Corpus Sample Clauses

The Corpus. For the remaining packages of the lexicon, an automatic contextual disambiguation is tried. To do this, a parallel corpus is used. The goal is to find conceptual contexts in the corpus which allow the disambiguation of translation alternatives.
AutoNDA by SimpleDocs
The Corpus. A corpus of manually-written summaries of texts has been assembled from materials provided to participants in the Document Understanding Conferences, which have been held annually since 2001. Most summaries in the corpus are abstracts, written by human readers of the source document to best express its content without restriction in any manner save length (words or characters). One method of performing automatic summarization is to construct the desired amount of output by concatenating representative sentences from the source document, which reduces the task to one of determining most adequately what ‘representative’ means. Such summaries are called extracts. In 2002, recognizing that many participants summarize by extraction, NIST produced versions of documents divided into individual sentences and asked its author volunteers to compose their summaries similarly. Because we use a sentence- extraction technique in our summarization system, this data is of particular interest to us. It is not included in the corpus being treated here and will be discussed in a separate paper. The XXX xxxxxx contains 11,867 files organized in a three-level hierarchy of directories totaling 62MB. The top level identifies the source year and exists s imply to avoid the name collision which occurs when different years use same-named subdirectories. The middle 291 directories identify the document clusters; 1 This work will also be presented at the ACL Text Summarization Workshop in Barcelona, July 25-26, 2004 DOCUMENTS SUMMARIES D : S 10 50 100 200 □ 10 50 100 200 □ 2001 28 316 56 400 84 949 165 1198 1 : 3 2002 59 59 626 59 803 116 116 1228 116 1576 1 : 2 2003 624 90 714 2496 360 2856 1 : 4 2004 740 124 864 2960 496 3455 1 : 4 □ 1423 87 1156 115 2781 5572 200 3033 281 9086 1 : 3 Table 1: Number of Documents and Summaries by Size and by Year with Document : Summary Ratios DUC reuses collections of newswire stories assembled for the TREC and TDT research initiatives which report on a common topic or theme. Directories on the lowest level contain tagged and untagged versions of 2,781 individual source documents, and between one and five summaries of each, 9,086 in total. In most cases the document involved is just that: a single story originally published in a newspaper. However 552 directories, approximately 20% of the corpus, represent multi- document summaries—ones which the author has based on all the files in a cluster of related documents. For these summaries we constructed...
The Corpus. The initial data used to examine the issues mentioned above are first taken from previous accounts on conjunct agreement in both English and Serbian. Thus, the data from English are provided by Lorimor (2007), among others, and the initial data from Serbian are found in Xxxxxxx (1983), Xxxxxxxxxx (1979), and Xxxxxxxx (2009). After the examination of these works and identification of basic problems, a survey was conducted in order to look into the basic patterns of agreement employed by speakers of Serbian in their active production. The survey was completed by 60 participants, native speakers of Serbian. The speakers were asked to do a production task, supplying the missing agreement information on the verb based on the conjoined subjects, whose features were varied. The results of this survey provide the material based on which a theoretical model of conjunct agreement is developed in the thesis. The thesis is organized as follows. Section 2 gives a detailed introduction on the process of agreement, and the role of features in that process, as well as the nature of features themselves. Section 3 focuses on agreement with conjoined subjects. It provides a brief overview of agreement patterns with conjoined subjects in English and Serbian. The purpose of Section 4 is to explain the mechanism of agreement and the structure of coordinate phrase, so as to help the reader understand syntactic mechanisms of conjunct agreement provided in the following sections. Section 5 presents previous syntactic accounts on conjunct agreement. The accounts presented here provide a basis for the analysis of the data gained in the research. Section 6 identifies basic problems tackled by the research. Subsequently, it presents the results of the research together with their analysis. Section 7 contains concluding remarks.
The Corpus. In this chapter, the generation of Covid-themed tweets dataset will be discussed in details. The source of data (Section 3.1), the mechanism and word choice for tweet scraping ((Section 3.1), the rationale for choosing data produced in the twelve-day span (Section 3.2), the preliminary filtering process (Section 3.3), and string removal (Section 3.4) will be elaborated to demonstrate our dataset’s integrity. To ensure the quality of the data, we additionally apply quality assurance procedures (Section 3.5) with a hope to convince readers that Covid-themed Tweets Dataset could serve as a valid and rich event detection research resource in NLP community.
The Corpus. For our corpus study we extracted data from the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus).4 The CGN is based on roughly 1000 hours of contemporary Dutch from the Netherlands and Flanders. The speech is composed of different genres, ranging from face-to-face and telephone conversations to interviews, debates, radio 4xxxx://xxxxx.xxx.xxx.xx/cgn/ehome.htm shows and read aloud books. The speech files amounting to roughly 10M words have been orthographically transcribed, lemmatized, and tagged for part-of-speech information. Moreover, about 10% of the corpus has been syntactically annotated (van der Wouden et al. 2002). From this syntactically annotated part of the corpus we have ex- tracted all prepositional phrases. This amounted to 57,287 PP in- stances containing 139 unique adpositions and 12,947 unique heads in the adpositional complements. From this set we extracted all heads of the adpositional complements with a frequency higher than 10 oc- currences. These 766 unique words were subsequently annotated by the two authors for their animacy using the coding scheme of Xxxxxx et al. (2004) which provides a 9-way classification. Where possible, disagreement was resolved by discussion. Of these 766 words, 154 were left out due to unresolved disagreement between the two an- notators and 53 because they contained context-dependent elements,

Related to The Corpus

  • Executive Committee (A) The Executive Committee shall be composed of not more than nine members who shall be selected by the Board of Directors from its own members and who shall hold office during the pleasure of the Board.

  • The Treasurer The Treasurer shall have custody of and be responsible for all funds and securities of the Company, shall keep full and accurate accounts of receipts and disbursements in books belonging to the Company and shall deposit all monies and other valuable effects in the name and to the credit of the Company in such depositories as may be designated by the Management Directors. He shall disburse the funds of the Company as may be ordered by the Management Directors, taking proper vouchers for such disbursements, and shall render to the President and the Management Directors, whenever they may require it, an account of all his transactions as Treasurer and of the financial condition of the Company.

  • GRANTEE Grantee will be in default under this Grant upon the occurrence of any of the following events:

Time is Money Join Law Insider Premium to draft better contracts faster.