Transformers for Vision
Moving from Disjoint AI’s to the Universal Compute Engine…
GitHub Repository of our Exploration
When did this convergence happen?
If we look back about a decade, AI had many flavors. Typically we had Convolution based methods for Images, Sequence models such as the RNN/LSTM, and then mixtures between them for other tasks like speech processing, image captioning, etc.. If we look today, essentially ALL state of the art models are using the Transformer architecture! This notion of Self-Attention was so powerful, that its principles carried over regardless of the modality.
Vision is one such task, where convolutions exploited local features and learned salient representation of images through the training process, but lacked any global awareness or context of an image. What was even tougher was we depended on the last layer of extracted features to then pass to a sequence model to decode text in an image captioning task.
The convergence on the transformer architecture regardless of modalities allows us to perform shared attention computation, where features of the Images and Text could be meshed at every step of the architecture. In order to explore this more, we first wanted to learn about the Vision Transformer in general!
Self-Supervised Vision Training
Our main interest was to explore the state of the art self-supervised Vision Transformer model called DINO that used a Teacher-Student setup to automatically learn features in the dataset. By doing so, it has an additional ability where our attention maps encoded in the ViT can be plotted and look almost like self-supervised segmentation! Even in the example above, there is a lot of object surrounding my cat, but the attention is focused squarely on him This paradigm would make it easy to grab the ViT backbone from this model, slap on a decoder head for any task of interest (classification, object detection, segmentation) and get incredible performance with limited data on downstream tasks!