Principle Component Analysis Sample Clauses
Principle Component Analysis. At first, the goal of using PCA was to look at the covariance matrices that it generates to gain insight on which features are the most important to train the model on. This proved to be difficult to decipher however, as each dimension in a PCA converted data set relies on multiple features at once. Instead, using the built-in attribute from the Scikit-learn library to estimate the coefficients of each feature has been used for this. For these results, refer to section 6. While experimenting with applying PCA on the model, another use for PCA was discovered. Because it reduces the dimension size of the amount of features used while combining multiple features into one dimension, this allows features that correlate to combine. Using this method to allow for less noise before the data is actually trained could be useful. Figure 3: The F1 scores per amount of PCA dimensions used, tested in 10-fold cross-validation downsampling results in the graph shown in figure 3, where the x-axis represents the amount of dimensions used in the PCA-converted data. Only logistic regression is shown in this case, as PCA calls for normalized data which does not work well on a Bernoulli naive Bayes algorithm, as it interprets any value bigger than 0 as 1 and any value below that as 0. What can be seen in this figure is that reducing the dimensions of the feature by 2 still results in around the same F1 score. In fact, at 13 dimensions, the average F1 score over 10-fold cross validation goes up from 61.25 to 61.51. Unfortunately, this increase again proves to be insignificant (p>0.05). However, this graph raises a conjecture that the model could work with less features, and that some features used earlier might be either insignificant or could be combined with other existing features in the model.
