Evaluating ML Classifiers Clause Samples
Evaluating ML Classifiers. It is necessary to perform assessments of the quality of the predictions that an ML model produces. for this purpose, a validation set and a test set containing input samples from the same distribution as the training data are utilized. During the development phase, the model’s performance is assessed by having it produce predictions for samples from the validation set. If the model’s performance falls below the intended or anticipated level, the hyperparameters and configuration are adjusted, and the model is re-evaluated on the validation set as part of an ongoing cycle of development and improvement. Therefore, the model is tuned based on its performance on this data. Once the model’s performance on the validation set reaches the desired level, a final assessment of it is conducted using the completely unseen test set prior to its deployment. This approach is intended to limit overfitting, which occurs when a model is overtuned on local data, preventing it from generalizing to unseen samples. This means that it performs well under development but inadequately against unseen samples in the wild. With a completely independent test set, the model’s performance can be assessed on unseen samples that were not used to tailor its performance during its construction, therefore avoiding overfitting. To facilitate the examination and comparison of models, metrics that are easier to interpret are typically employed. A core metric is accuracy, which measures the proportion of correct predictions. In the malware detection domain, this is a measure of the proportion of predictions that match the true label (e.g., benign or malware). However, it is not appropriate to rely only on a single metric, as the binary classification problem of malware detection is multifaceted [175]. As the possible predictions in binary classification tasks are either positive or negative, the classifier’s predictions can only be correct (therefore, true positives (TP) and true negatives (TN)), or erroneous (therefore, false positives (fP) and false negatives (fN)). These metrics allow for the development of the true positive rate (i.e., the proportion of samples that were correctly predicted as positive) and the false positive rate (i.e., the proportion of samples that were incorrectly predicted as positive). Within the malware detection domain in particular, fPR must remain low [90, 202, 229, 125] lest a system be deployed that incorrectly (and frustratingly) flags legitimate queries and i...
