Regularization. In our pre-trained experiments with SDA, we did not apply extra regularization during fine-tuning because the pre-training acts as a regularizer. For the DNN experiments we used L2-regularization and dropout. We ap- plied dropout to both the input and hidden layers, as adding dropout to the input layers has reduced error rates in some studies [18]. For dropout, we used 10% and 20% for the input layer, and 40% and 50% for the hidden layers, which follows the research of ▇▇▇▇▇▇▇▇▇▇ et al. [40]. In addition, a factor of 0.0001 was used for L2 weight decay regularization, that adds a term to the cost function to penalize large weights. Stop criterion - We stopped training after 200 epochs for ▇▇▇▇ pre-trained by SDA, and after 300 epochs for ▇▇▇▇ without pre-training; alternatively, we stopped the training if within 10 epochs after a new low in validation error, no new low below current low multiplied by a threshold (0.995) was reached. This decision was motivated by the desire to continue training after attaining a new low to search for another new low. However, this was limited to prevent overfitting. Cost function - For SDA pre-training, we used the squared error. If we have 𝑘 training examples this can be calculated as follows: ∑︁ 𝐶(𝜃) = (𝑟𝜃(x𝑖) − y𝑖)2 , (4) where 𝜃 represents the parameters (weights of the neural network), 𝑟𝜃 represents the reconstruction vector (using 𝜃). The negative log-likelihood function was minimized for DNN: ∑︁ 𝐶(𝜃) = − log(𝑃 (y𝑖|x𝑖, 𝜃)) . (5)
Appears in 1 contract
Sources: End User Agreement
Regularization. In our pre-trained experiments with SDA, we did not apply extra regularization during fine-tuning because the pre-training acts as a regularizer. For the DNN experiments we used L2-regularization and dropout. We ap- plied dropout to both the input and hidden layers, as adding dropout to the input layers has reduced error rates in some studies [18]. For dropout, we used 10% and 20% for the input layer, and 40% and 50% for the hidden layers, which follows the research of ▇▇▇▇▇▇▇▇▇▇ et al. [40]. In addition, a factor of 0.0001 was used for L2 weight decay regularization, that adds a term to the cost function to penalize large weights. Stop criterion - We stopped training after 200 epochs for ▇▇▇▇ pre-trained by SDA, and after 300 epochs for ▇▇▇▇ without pre-training; alternatively, we stopped the training if within 10 epochs after a new low in validation error, no new low below current low multiplied by a threshold (0.995) was reached. This decision was motivated by the desire to continue training after attaining a new low to search for another new low. However, this was limited to prevent overfitting. Cost function - For SDA pre-training, we used the squared error. If we have 𝑘 training examples this can be calculated as follows: ∑︁ 𝐶(𝜃) = (𝑟𝜃(x𝑖) − y𝑖)2 , (4) where 𝜃 represents the parameters (weights of the neural network), 𝑟𝜃 represents the reconstruction vector (using 𝜃). The negative log-likelihood function was minimized for DNN: ∑︁ 𝐶(𝜃) = − log(𝑃 (y�∑︁�|x𝑖, 𝜃)) . (5)
Appears in 1 contract
Sources: End User Agreement