Review of Attention (Vision Models) - Part 2

Attention in Vision Models:

In the previous article, we briefly discussed the concept of attention and its application in the language domain. In this article, we will discuss some of the works that have applied attention in visual models tasks.

Although vision models have traditionally used Convolution Networks (ConvNet) as the standard for image encoders, there are several works that have tried using attention mechanisms at various stages in a vision model. Most of these works apply attention along with traditional convolution mechanisms to improve the performance of a model in a particular task. Some works have gone beyond that and tried to replace convolution with purely attention-based techniques. In this article, we will review models from both approaches.

One of the common techniques in vision is to apply self-attention to the image features generated by traditional ConvNet and use the output of the attention block in the downstream task.

Show, Attend, and Tell:

Xu et al. 2015 used attention for the task of image captioning [1]. The model used a ConvNet to extract image features, which were fed to attention-based RNN for generating text descriptions.

The paper introduced two methods of performing attention: Soft and Hard. The difference is primarily in how the model operates on the patches from the input image. In Hard attention, the model operates on a single patch at a time. It is called soft attention if the model smoothly attends to all the patches at the same time to make the operation differentiable. Although hard attention is non-differentiable, the paper adopted complex techniques, such as variance reduction and reinforcement learning, to learn its parameters.

Another useful aspect of applying attention is the ability to visualize the attention weights. It can be used to visualize the regions that the model focused on while generating each word in the caption.

Fig. 1. A giraffe standing in a field with trees.[1]

Self-Attention Generative Adversarial Networks (SAGAN) [2]:

GANs have traditionally proved to be great tools for image synthesis. Convolution GAN limits the receptive field to spatially local points in low-resolution feature maps. Zhang et al. 2019 addressed this limitation by introducing self-attention, which lets the model attend to all the feature locations and use them for image synthesis [2].

Three transformations are applied to the image features from the hidden layer. These can be considered like Query, Key, and Value matrices. Applying SoftMax on the dot product between the Query and Key matrices results in an attention map. An attention map indicates the extent to which each position in the Query matrix should attend to each position in the Key matrix. The attention map is multiplied with the Value matrix, and a 1 X 1 convolution is applied to the result to obtain self-attention feature maps.

Fig 2. Self-Attention module in SAGAN [2]

The self-attention feature map o_i is multiplied by a learnable scalar and added to the original feature map x_i from the hidden layer to generate the final output y_i.

y_i = γo_i +x_i

SAGAN efficiently captures long-range dependencies and performs especially well in the class conditional image generation. It significantly improved the state-of-the-art inception score in the task at the time of its publication.

Self-Attention as an alternative to convolution:

Inspired by the success of transformer architecture in the text domain, several papers explored using attention as an alternative to convolution in the vision models [3] [4] [5]. Since attention has proven to be a very useful technique in learning sequential data, most of these models treated images as a sequence of pixels and operated on them.

There are few advantages of using self-attention instead of convolution. ConvNets operate on pixels in the smaller neighbourhood (or kernel sizes) and efficiently learn local correlation structures. Therefore, they are limited by the spatial proximity to the pixel. But attention is also effective in learning long-range dependencies between distant positions in an image [3].

The downside is that the computation complexity of the self-attention layer drastically increases with the dimension of the image. Parmar et al. 2018 calculated the time complexity of a self-attention layer operating on l_m positions as O(h.w.l_m.d). It is computationally feasible for the model to attend to 192 positions in an 8 X 8 image, but it couldn’t be scaled up to 3072 positions in a 32 X32 image.

Additionally, the vision models need to have the correct inductive biases to enable them to efficiently learn image features.

Translation equivariance:

The model should be resilient to minor perturbations in the pixel distribution. If an object is shifted by a few pixels in an image, the model shouldn’t interpret it as a completely new image; instead, it should be able to recognize similar global contexts present in the two images.

The relative position of the pixels:

The context of the image is dependent on the relative position of the pixels in the image.

The model should be able to encode the relative positions of the pixels in the image and utilize this information to generate the output signal.

We’ll explore a few works that deal with these challenges.

Image Transformer:

Parmar et al. 2018 developed a purely attention-based transformer architecture for image generation. The images were formulated as a sequence of pixels, and the model was trained on a sequence completion objective, i.e., to generate the next pixel in the image, conditioned on the previous set of pixels [3].

The self-attention layer computed a d-dimensional representation for each position; each channel of each pixel. The representation for a given position is calculated as a weighted sum of contributions from previous positions. The weights are determined by the attention distribution over previous positions. Instead of attending to all the previous inputs, the self-attention layer attends only to a fixed number of positions in the local neighbourhood like a ConvNet. This addresses the computation challenge with adopting attention to images.

One advantage of attention is parallel processing; the query needn’t be computed for each pixel. Instead, the image is split into a fixed set of contiguous blocks called memory blocks. For all queries in a memory block, the model attends to the same memory matrix.

Parmar et al. proposed two schemes for deciding the pattern for query/memory blocks: 1-D local attention, where the model attends to non-overlapping query blocks of fixed length, while 2-D attention uses non-overlapping rectangular query blocks.

Fig. 3. Local 1-D vs 2-D attention [3].

In both types of attention, each position in the query block attends to all the pixels in the memory block. In figure 3, the pixel marked as q is the pixel last generated at that time step. The positions marked in white within query/memory blocks use masked attention, and they don’t contribute to the next representation of positions in the query block.

In this article, We reviewed some of the approaches of applying attention to vision models. We will continue this discussion in the next article and review a few additional approaches in this domain.

Review of Attention (Vision Models) – Part 2

Effective UX Design Communication in a Remote Environment