Experiments and Evaluation Clause Samples
Experiments and Evaluation. Unsupervised models described above were all evaluated with the self-annotated Red- dit College test set. Evaluation metrics used for unsupervised approaches were con- sistent with the three metrics for this dataset as described in Section 4.2.3, including precision, recall, and F1 score. Figure 4.5: Pipeline of unsupervised two-label approach According to the evaluation metrics for single-label approaches shown in Table 4.7, the model with the best performance was the 32To8 approach with ▇▇▇▇▇▇▇-base, producing the highest precision, recall, and F1 score. Both the 32To8 and Merged-8 approaches with ▇▇▇▇▇▇▇-large performed worse than those with ▇▇▇▇▇▇▇-base. Model Approach Precision Recall F1 Score ▇▇▇▇▇▇▇-base 32To8 Merged-8 0.735 0.681 0.492 0.456 0.589 0.546 ▇▇▇▇▇▇▇-large 32To8 Merged-8 0.708 0.672 0.469 0.449 0.565 0.538 Table 4.7: Evaluation of single-label unsupervised models on the self-annotated Reddit College test set For two-label approaches, two experiments were performed to compare and select the best model. Experiment 1, as described in Section 4.3.2, chose the merged label as the first output emotion label and the original top 1 prediction from the Transformer baseline models as the second output for ambiguous input utterances. According to results shown in Table 4.8, the 32To8 approach with ▇▇▇▇▇▇▇-base outperformed all other models with the highest precision, recall, and F1 scores. Overall, the 32To8 approach achieved higher F1 scores than the Merged-8 approach for both ▇▇▇▇▇▇▇- base and ▇▇▇▇▇▇▇-large models. Model Approach Precision Recall F1 Score ▇▇▇▇▇▇▇-base 32To8 Merged-8 0.643 0.568 0.602 0.534 0.622 0.550 ▇▇▇▇▇▇▇-large 32To8 Merged-8 0.697 0.564 0.528 0.528 0.601 0.545 Table 4.8: Evaluation of two-label Experiment 1 models Experiment 2, as described in Section 4.3.2, chose the original top 1 prediction from the Transformer baseline models as the first output and the original top 2 predic- tion with a certain probability difference threshold as the second output. According to results shown in Table 4.9, the 32To8 approach with ▇▇▇▇▇▇▇-base and ▇▇▇▇▇▇▇- large performed roughly the same with the highest F1 scores. The difference lies in that the 32To8 ▇▇▇▇▇▇▇-base had a higher precision score than the ▇▇▇▇▇▇▇-large model. Overall, the 32To8 approach achieved higher precision, recall, and F1 scores than the Merged-8 approach for both ▇▇▇▇▇▇▇-base and ▇▇▇▇▇▇▇-large models in this experiment. Model Approach Precision Recall F1 Score ▇▇▇▇...
Experiments and Evaluation. Both the 32To8 single-label approach and the Merged-8 single-label approach were experimented with ▇▇▇▇, ▇▇▇▇▇▇▇-base, and ▇▇▇▇▇▇▇-large models and tested with ED-32, ED-8, and the the self-annotated Reddit College datasets. The evaluation metrics used here varied according to the test datasets. When the test data was from ED-32 and ED-8, model accuracies were calculated as the number of true predictions divided by the total number of predictions since the dataset only contained one true label for each utterance, which means that the number of predictions equal to the number of true labels. When the test data was from the self- annotated Reddit College dataset, three metrics were calculated: 1) Precision, the number of true predictions divided by the total number of predictions; 2) Recall, the number of true predictions divided by the total number of true labels; 3) F1 Score, the harmonic mean of precision and recall (2 times the product of precision and recall divided by the sum of precision and recall). For each of the approaches (32To8 and Merged-8 single-label approaches described above), model accuracies with ▇▇▇▇, ▇▇▇▇▇▇▇-base, and ▇▇▇▇▇▇▇-large for de- tecting emotions with the corresponding test set and number of emotions were mea- sured and compared. Among 32To8 single-label classifiers, accuracies increased for all Transformer models after the 32 emotions were merged into 8 labels (see Table 4.1). This means that the process of merging was effective in detecting emotion more accurately for the Empathetic Dialogues dataset. The accuraccies in 32To8 mod- els for ED-8 were also slightly higher than the ones produced by Merged-8 models, meaning that classifying utterances with 32 emotions and then merging them into 8 emotions was more effective than directly classifying utterances with 8 emotions. Overall, the model with the highest accuracy was the 32To8 single-label approach with ▇▇▇▇▇▇▇-base, which had an accuracy of 0.819. Model Approach Dataset Accuracy ▇▇▇▇ 32To8 ED-32 0.575 ED-8 0.770 Merged-8 ED-8 0.762 ▇▇▇▇▇▇▇-base 32To8 ED-32 0.604 ED-8 0.808 Merged-8 ED-8 0.801 ▇▇▇▇▇▇▇-large 32To8 ED-32 0.627 ED-8 0.819 Merged-8 ED-8 0.805 Table 4.1: Accuracy of single-label baseline models on Empathetic Dialogues These models were then tested on the self-annotated Reddit College test set de- scribed in Section 3.3.3. For ▇▇▇▇-based models, the performance of the 32To8 approach was worse than that of the Merged-8 approach. However, for ▇▇▇▇▇▇▇- base and ...
