Transformers, More Than Meets the Eye How Transformers are Changing the Field of Computer Vision

6 min readJan 5, 2021

By Scott Clark, Senior Research Scientist at Clarifai

Deep learning (DL) has seen an explosion of activity in the last 20 years. Thanks to the internet and the investment of companies like Clarifai and NVIDIA, advances in deep learning are taking place every year, particularly so in the fields of computer vision and natural language processing. While CV and NLP both make use of deep learning, it has traditionally been the case that they employ drastically different neural network architectures to perform their various tasks.

There has been some level of cross-pollination, e.g. the use of convolutional neural networks (historically used in computer vision) in NLP for text classification, but by and large, an architecture that has state-of-the-art (SOTA) in image classification is unlikely to hold the title of SOTA on an NLP task and vice versa. However, a recent paper has shown that a neural network module typically associated with NLP, the transformer, can be applied to CV tasks and produce results on par with SOTA convolutional architectures. This is a massive development in the deep learning space, and has some incredibly far reaching implications for the future of neural network research. Since AlexNet was released in 2012, convolutional neural networks have dominated the image processing space in deep learning.

CNNs are particularly useful for dealing with data in which spatial reasoning is key, like an image: a typical image is comprised of a HEIGHTxWIDTHxCHANNELS array, where each X/Y location, or pixel, within the array encodes the intensity of red, green and blue (RGB).

CNNs apply a series of learned filters to pixels within a local neighborhood, extracting features that the downstream network layers combine and leverage to perform a specific task, such as image classification or object detection. A common example used to demonstrate the value of CNNs in introductory deep learning courses is MNIST, a simple image classification dataset for handwritten numbers. A filter in the first layer of a CNN might be an edge detector, whereas a filter at a later step in the architecture would make use of these identified lines to identify larger structures like the circles found in number ‘8’ or the right angle found at the top of the number ‘5’.

For this reason, CNNs are commonly described as hierarchical feature extractors: early layers identify low level structures within the image (e.g., edges, lines, curves), and deeper layers combine these to identify macro structures (e.g., an eyeball, a car headlight, etc.). The identification of these different structures allows the traditional fully-connected/dense layers at the deepest layers of the model determine what class the image corresponds to.

While CNNs have been employed in CV since the dawn of modern deep learning, a newer architecture called a transformer has become common practice for dealing with unstructured text data. Prior to the advent of the transformer, most NLP tasks relied on some form of recurrent neural network (RNN), in which (typically) each data point is processed iteratively according to some kind of sequential ordering.

While RNNs have a solid inductive bias with temporal/sequential data like sentences, each new piece of information (in the NLP case, each word) must be processed serially. This poses a couple of problems: firstly, this approach prevents parallelization, dramatically increasing the time to process one sample. Secondly, because RNNs process information iteratively, passing only partial information forward in time, there is no truly global operation happening, and information from the beginning of a sentence can be forgotten by the network by the time it begins processing the final timestep. In 2016, in the paper Attention is All You Need, the authors showed that recurrent and convolutional architectures could be done away with by leveraging what the authors referred to as an attention mechanism. In the text domain, this attention mechanism creates a matrix for each sentence that learns pairwise relationships between each word by applying a linear transformation to each word vector and calculating a (scaled) dot product between them.

Given that the transformer is natively permutation invariant, and that the same words can mean different things in different orders (e.g., “he went to the pet store” versus “he went to store the pet”), each word embedding is concatenated with a positional embedding. By getting rid of recurrent connections in the network, issues traditionally associated with RNNs (vanishing gradient with long sequences, loss of info in the forward pass, lack of parallelization) are mitigated, and thus training with massive models becomes possible.

Transformers have been used for NLP since their inception, but they’ve also been slowly creeping their way into the computer vision space. Back in May of 2020, Facebook AI released “End-to-End Object Detection with Transformers,” a paper that proposes the DETR architecture in which the box and classification heads typically attached to a convolutional encoder for object detection are replaced with a transformer, allowing for truly anchorless object detection using global attention.

While this result was interesting and a large step forward for the use of transformers in computer vision, their architecture still relied on convolutional layers to encode features and was not state of the art with respect to mean average precision or inference speed. In the even more recent paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, a group of authors used a transformer for images and showed they can outperform current SOTA architectures.

Rather than looking at individual words, the authors split an image into different patches which are flattened into vectors and embedded with a positional encoding. Similar to Google’s BERT, these patches are also given a learnable feature which represents the class the images belong to. During the experiment the authors built three ViT architectures with varying levels of complexity: a base model, a large model, and a huge model. Each model was pretrained on the JFT-300M data set, a proprietary data set owned by Google. These models were tested on a variety of benchmarks against two previous SOTA Google models BiT (Kolesnikov et al 2020) and Noisy Student (Xie et al 2020), both of which were also trained on the JFT-300M dataset. All three ViT models outperformed BiT and Noisy Student using significantly less computational resources.

The exact results are provided below including all three ViT models, the task (ImageNet, CIFAR, etc), and the compute time in TPUv3-core-days (the amount of days necessary to train times the number of TPUv3 cores). Though it’s very encouraging that ViT is able to converge at a lower TPU cost than Noisy Student or BiT, it’s also worth noting that ViT in its current state is still likely prohibitively expensive for the majority of independent researchers, requiring 230–2.5k TPUv3-core-days. The fact that these models were also trained on a dataset exclusive to Google does not do much to bolster their accessibility.

The authors conducted additional experiments using a “hybrid” architecture, in which their patch-based transformer architecture is applied to convolutionally extracted feature maps instead of raw pixels, similar to how DETR leverages the transformer module. The authors found that in the “low” compute regime, the use of a convolutional encoder bolsters performance significantly, but that as pre-training compute increases, the gap in performance vanishes; this seemingly tracks with the reputation of transformers being extremely data and compute hungry beasts. While Transformers are not new to deep learning, their successful application to computer vision is.

Transformers holding the SOTA in a vision benchmark is certainly a massive breakthrough, but it’s unclear whether they’ll be able to compete with convolutional networks in the (relatively) “low-data low-compute” regime long term. Even more interesting than the results of any one paper is the potential for a convergence of NLP and CV around similar architectural components; if this trend is to continue, it could rapidly accelerate the progress of the field as a whole as the DL community, as its many niches and sub-categories begin to adopt similar techniques to solve wildly different problems.

Transformers have only been around for 4 years, but it’s clear that their impact on DL research will resonate for years to come. www.clarifai.com

Transformers, More Than Meets the Eye How Transformers are Changing the Field of Computer Vision

Written by Thomas Molfetto