Mask Training Clause Samples
Mask Training. As we described in Section 3.2.2 of the main paper, we realize mask training via binarization in forward pass and gradient estimation in backward pass. Following [42, 32], we adopt a magnitude- based strategy to initialize the real-valued masks. Specially, we consider two variants: The first one (hard variant) identifies the weights in matrix W with the smallest magnitudes, and sets the corresponding elements in mˆ to zero, and the remaining elements to a fixed value: mˆ i,j 0 if Wi,j ∈ Mins(abs(W)) = α × ϕ otherwise
Mask Training. Mask training treats the pruning mask m as trainable parameters. Following [35, 66, 42, 32], we achieve this through binarization in forward pass and gradient estimation in backward pass. Each weight matrix W ∈ Rd1 ×d2 , which is frozen during mask training, is associated with a bianry mask m ∈ {0, 1}d1 ×d2 , and a real-valued mask mˆ ∈ Rd1 ×d2 . In the forward pass, W is replaced = with m ⊙ W, where m is derived from mˆ through binarization: ▇▇,▇ 1 if mˆ i,j ≥ ϕ 0 otherwise
(1) where ϕ is the threshold. In the backward pass, since the binarization operation is not differentiable, we use the straight-through estimator [3] to compute the gradients for mˆ using the gradients of m, i.e., ∂L , where L is the loss. Then, mˆ is updated as mˆ ← mˆ − η ∂L , where η is the learning rate. Following [42, 32], we initialize the real-valued masks according to the magnitude of the original weights. The complete mask training algorithm is summarized in Appendix A.1.2.
