Dialogue Extraction Sample Clauses
Dialogue Extraction. To build our own dataset, we develop a statistical algorithm to extract sub-scenes from each scene. Sub-scenes are defined as extracts from a whole scene. The motivation to extract sub-scenes is to have annotators work on a comparatively short conversation which contains enough information to tell the personality of one speaker at once. Otherwise, annotators will spend long time reading the whole scene and have difficulties paying attention to important parts of the scene. Moreover, sub-scene extraction provides more annotations than using each scene once, which is beneficial to building a large-scale dataset.
Figure 3.1: An overview of the extraction algorithm. U indicates utterances. For this algorithm (Fig 3.1), we first find a set of main speakers in each scene. For each main speaker, we use a sliding-window technique to construct the frequency distribution graph in the scene. We pick the peaks in each frequency distribution graph if the peak frequency is bigger than a threshold like 2. Each peak is then used to identify the index range of consecu- tive sentences in which the speaker dominates temporarily. Each set of consecutive sentences extracted from the scene is called a sub-scene. In the example of Fig 3.1, the two index ranges identified should be 6 to 15 and 22 to 30. In those two sub-scenes, ▇▇▇▇▇▇ ▇▇▇▇▇▇ is a main character. After optimizing the algorithm to get the maximum number of reliable sub-scenes, we generate 8738 sub-scenes from 10-season Friends transcripts with the minimum utterance number of a sub-scene to be 4.
Dialogue Extraction. 12 3.1.3 Corpus Annotation . . . . . . . . . . . . . . . . . . 13
