Convolutional Neural Networks (CNNs) have been solely dominating the field of computer vision for nearly a decade. In this talk I will present two recent papers that propose new and highly competitive architecture classes for computer vision. In the first part I will present the Vision Transformer model (ViT), which is almost identical to the standard transformer model used in natural language processing, but happens to work surprisingly well for vision applications. In the second part of the talk, I will present the MLP-mixer model: an all-MLP architecture for vision. It can be seen as a simplified ViT model without the self-attention layer. Nevertheless, it also demonstrates strong results across a wide range of vision applications.