Network Architecture Overview Clause Samples
Network Architecture Overview. In our deep domain-invariant embedding neural network, we deploy five blocks of ResNet-50 [162] as a primary feature extractor and follow the training strategy in [148] which fine-tunes on the ImageNet pre-trained model. We construct our backbone network by replacing Global Average Pooling (GAP) layer and the last 1,000-dim fully connected (FC) layer with two pooling layers and a 2,048-dim FC layer followed by batch normalization (BN) [163] and PReLU [164], as shown in Fig.5.2. Inspired by CBAM [165], we use both average-pooled and max-pooled features to keep the distinctive object clues gathered by max-pooling. Specifically, we concate- nate the two outputs of the global max pooling and global average pooling and feed them to the next FC layer. The output of this FC layer is a 2,048-dim feature vector, which we call the “domain-invariant embedding” (DIE). For the purposes of strengthening the information flow and distilling the DIE feature, Recurrent Top-Down Attention (RTDA) is exploited to recurrently re-weight the channel and spatial position of feature maps simultaneously. The RTDA module is implemented by multiple deconvolution and convolution layers whose details will be described in the Section 5.3.4. Through T (T = 0, 1, 2, 3...) loops, we employ an 1 × 1 convolution to fuse the output of each loop and obtain the final DIE feature vector. Subsequently, the DIE features are fed into the centering constrained cross- domain triplet loss (CCCDTL) function after L2-normalization. At the same time DIE features are forwarded to both identity classifier (IC) and domain classifier (DC) module. The IC module consists of an FC layer and a Dropout layer[166]. This is a general multi-class classifier trained using standard cross-entropy loss function. This loss function is formulated as, LIC (Is) = 1 (y Σ− I∈Is log P(I) + (1 − yi) log(1 − P(I))) with Is ∪ It = I where I represents images in a training mini-batch. Is denotes images from the source (labeled) domain and It represents images from the target (unlabeled) domain. P(I)
