Feature Vector Clause Samples
Feature Vector. By extracting key information of education and work experience from each resume, the work experience information is in the form of a 5-by-1 vector with each entry representing a type of experience and its value means the duration of years, and the education information is in the form of a 7-by-1 vector with each entry representing a degree and its three values 0, 1, 2 representing no degree, non-science and non-health related degree, and science and health related degree, respectively. The dataset is now in the form of feature vectors and the models training on the dataset are Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machine. The accuracy of using these models on feature vectors to predict the level of CRC for a resume is in Table 5.3 Random Forest Feature Vectors 60.89±0.29 61.39 60.89 Gradient Boosting Feature Vectors 61.39 60.40 60.89 Table 5.3: Accuracy by using classification models on feature vectors. Figure 5.1: Accuracy vs Number of Estimators development set accuracy is highest when number of estimators is around 200. When number of estimators is bigger than 200, the training accuracy is increasing while the development accuracy is decreasing, which means it probably is overfitting. For different maximum depth of the random forest as shown in figure 5.2 Figure 5.2: Accuracy vs Number of Maximum Depth development set accuracy is highest when number of maximum depth is around 4. When it is bigger than 4, the training accuracy is increasing while the development accuracy is decreasing, which means it probably is overfitting. figure 5.3 Figure 5.3: Accuracy vs Number of Estimators development set accuracy is highest when number of estimators is around 100. When number of estimators is bigger than 100, the training accuracy is increasing while the development accuracy is decreasing, which means it probably is overfitting. For different learning rate of the gradient boosting model as shown in figure 5.4 Figure 5.4: Accuracy vs Learning Rate development set accuracy is highest when number of learning is around 0.01. When it is bigger than 0.01, the training accuracy is increasing while the development accuracy is decreasing, which means it probably is overfitting. Bags of Words Bags of words are using a 100-by-1 vector with each entry representing a word’s TF-IDF score to represent the resume. The dataset is now in the form of feature vectors and the models training on the dataset are Logistic Random Forest + Bags of Word...
Feature Vector. Based on the previous cleaned vectors for education and work experience, further cleaning should be done to create a feature vector for one specific resume. We decided to combine the two vectors as the feature vector. For Work Experience vector, we keep the same as the previous with each entry representing the corresponding year duration for each experience type. For Education vector, for each entry, if the corresponding degree for this entry is mentioned in the resume, then it is marked as 1. If no information, then it is marked as 0. If it is mentioned and also mentioned as science and health related degree, then it is marked as 2. After combining the two vectors, each resume has a 12-by-1 vector with first five entries representing the work experience information and the rest 7 representing the education information. By having all the feature vectors for resumes, we can do machine learning based on several models which have performed well in multiclass classification: Logistic Regression[5][9], Random Forest[2][16], Gradient Boosting[4], and Support Vector Machine[6][1][10]. In this thesis, more experiment details and results are listed in Section 5.2.
