Machine Learning Tasks Sample Clauses

Machine Learning Tasks. Machine learning (ML), in the broadest sense, is the method by which an algorithm solves a particular task using some dataset without being given direct instructions on how to do so. In numerous fields, it has achieved exceptional success [95, 57, 162]. ML has also played an increasing role in cybersecurity; for example, in finding spam tweets [149], threat hunting [49], network intrusion detection [217], and more. Machine learning has also been widely applied to malware detection — which is the focus of our work. With this, an ML model can predict whether a particular file or executable is benign or malicious [82]. Typically, machine learning tasks are either supervised or unsupervised [33]. In a supervised learning setting, the ML algorithm is provided with a set of input samples and their corresponding labels that, when put together, form the training data [114, 206]. The ML algorithm learns the association between input samples and labels from the training data in order to infer labels for unseen input samples in the future. This is achieved by recognizing the patterns and correlations in the individual properties and characteristics of data, known as features. Supervised learning tasks include classification, which produces categorical predictions (e.g., whether a file is benign or malicious), while regression algorithms produce numerical predictions (e.g., the likelihood of an attack in progress). In contrast, in an unsupervised learning setting, an algorithm can cluster input samples according to some notion of similarity [107], such as the analysis of the malware family of an input sample [78]. furthermore, semi-supervised learning has been proposed, where a small amount of labeled data is combined with a large amount of unlabeled data [237]. This is useful in instances where acquiring the class labels is challenging and requires specialized knowledge. In this dissertation, we focus on supervised learning — and more specifically, classification — as we are interested in the problem of distinguishing benign objects from malicious ones in the context of ML-based malware detection. Under this learning setting, an ML model is constructed using the training data, which consists of input samples and labels. The ML model can then predict whether an unseen input sample belongs to the benign or malware class through the patterns and correlations it has learned during its training.‌