Better than Transformer, BERT and GPT without Attention and MLPs are actually stronger.

Source: Heart of the Machine
This article is about 2800 words, it is recommended to read for 5 minutes
This paper explores Monarch Mixer (M2), a new architecture that is sub-quadratic in both sequence length and model dimension, and is highly hardware efficient on modern accelerators.

From language models such as BERT, GPT, and Flan-T5 to image models such as SAM and Stable Diffusion, Transformers are sweeping the world with unstoppable momentum, but people can’t help but ask: Is Transformer the only option?

A team of researchers from Stanford University and the State University of New York at Buffalo not only provides a negative answer to this question, but also proposes a new alternative technology: Monarch Mixer. Recently, the team published relevant papers and some checkpoint models and training codes on arXiv. By the way, this paper has been selected for NeurIPS 2023 and qualified for Oral Presentation.

Paper address:

https://arxiv.org/abs/2310.12109

Code address:

https://github.com/HazyResearch/m2

This method removes the high-cost attention and MLP in Transformer and replaces it with an expressive Monarch matrix, enabling it to achieve better performance at lower cost in language and image experiments.

This isn’t the first time Stanford has proposed an alternative technology to the Transformer. In June this year, another team from the school also proposed a technology called Backpack. See the article “Stanford Training Transformer Alternative Model: 170 Million Parameters, Debiased, Controllable and Highly Interpretable.” Of course, for these technologies to achieve real success, they need to be further tested by the research community and turned into practical and useful products in the hands of application developers.

Let’s take a look at the introduction to Monarch Mixer in this paper and some experimental results.

Paper introduction

In the fields of natural language processing and computer vision, machine learning models have been able to handle longer sequences and higher-dimensional representations, supporting longer context and higher quality. However, the time and space complexity of existing architectures exhibit a quadratic growth pattern in sequence length and/or model dimensions, which limits context length and increases scaling costs. For example, attention and MLP in Transformer scale quadratically with sequence length and model dimensionality.

In response to this problem, this research team from Stanford University and the State University of New York at Buffalo claims to have found a high-performance architecture whose complexity grows sub-quadratic with sequence length and model dimension. .

Their research was inspired by MLP-mixer and ConvMixer; both studies observed that many machine learning models operate by mixing information along the sequence and model dimension axes, and they often use a single operator for both axes. .

Finding expressive, sub-quadratic, and hardware-efficient hybrid operators is challenging. For example, MLP in MLP-mixer and convolutions in ConvMixer are both expressive, but they both scale quadratically with the input dimension. Some recent studies have proposed some sub-quadratic sequence hybrid methods. These methods use longer convolutions or state space models, and they all use FFT. However, the FLOP utilization of these models is very low and the model dimensionality is very low. It is still a second expansion. At the same time, there is some promising progress on sparse dense MLP layers without compromising quality, but some models may actually be slower than dense models due to lower hardware utilization.

Based on these inspirations, this research team proposed Monarch Mixer (M2), which uses a type of expressive sub-quadratic structured matrix: Monarch matrix.

Monarch matrices are a class of structured matrices that generalize the Fast Fourier Transform (FFT), and studies have shown that they cover a wide range of linear transformations, including Hadamard transforms, Toplitz matrices, AFDF matrices, and convolutions. They can be parameterized by the product of block diagonal matrices, called Monarch factors, interleaved with permutations.

Their calculation is subquadratic: if the number of factors is set to p, then when the input length is N, the computational complexity is , so that the computational complexity can be between O (N log N) when p = log N and p = 2between.

M2 uses a Monarch matrix to blend information along the sequence and model dimension axes. Not only is this approach easy to implement, it is also hardware efficient: Blocked diagonal Monarch factors can be computed efficiently using modern hardware that supports GEMM (Generalized Matrix Multiplication Algorithm).

The research team implemented an M2 layer as a proof-of-concept – written entirely in PyTorch, with less than 40 lines of code (including import packages), and which relies only on matrix multiplication, transpose, reshape and element-wise product (see Pseudocode in the middle of Figure 1); as a result, these codes achieve 25.6% FLOP utilization on an A100 GPU for an input size of 64k. On newer architectures such as the RTX 4090, a simple CUDA implementation can achieve 41.4% FLOP utilization for the same size input.

For more mathematical description and theoretical analysis of Monarch Mixer, please refer to the original paper.

Experiment

The research team compared Monarch Mixer and Transformer on three tasks where Transformer has dominated: BERT-style acausal mask language modeling task, ViT-style image classification task, and GPT-style causal language modeling task. .

On each task, experimental results show that the newly proposed method can achieve a level comparable to Transformer without using attention and MLP. They also evaluated the speedup of the new method compared to the powerful Transformer baseline model in the BERT setting.

Acausal Language Modeling

For non-causal language modeling tasks, the team built an M2-based architecture: M2-BERT. M2-BERT can directly replace BERT-style language models, and BERT is a major application of the Transformer architecture. For the training of M2-BERT, masked language modeling on C4 is used, and the tokenizer is bert-base-uncased.

M2-BERT is based on the Transformer backbone, but the attention layer and MLP are replaced by the M2 layer, as shown in Figure 3.

In the sequence mixer, attention is replaced by bidirectional gated convolution with residual convolution (see Figure 3, left). To restore the convolution, the team set the Monarch matrix to a DFT and inverse DFT matrix. They also added depth-wise convolutions after the projection step.

In the dimensionality mixer, the two dense matrices in the MLP are replaced by learned block diagonal matrices (monarch matrices of order 1, b = 4).

The researchers pre-trained four M2-BERT models: two of them were M2-BERT-base models with sizes of 80M and 110M respectively, and the other two were M2-BERT-large models with sizes of 260M and 341M respectively. They are equivalent to BERT-base and BERT-large respectively.

Table 3 shows the performance of the model equivalent to BERT-base, and Table 4 shows the performance of the model equivalent to BERT-large.

As can be seen from the table, on the GLUE benchmark, M2-BERT-base’s performance is comparable to BERT-base, while having 27% fewer parameters; when the number of parameters between the two is equal, M2-BERT-base outperforms BERT. -base 1.3 points. Similarly, M2-BERT-large, which has 24% fewer parameters, performs equally well as BERT-large, while with the same number of parameters, M2-BERT-large has a 0.7-point advantage.

Table 5 shows the forward throughput of the model equivalent to BERT-base. What is reported is the number of tokens processed per millisecond on the A100-40GB GPU, which reflects the inference time.

As can be seen, the throughput of M2-BERT-base even exceeds the highly optimized BERT model; compared to the standard HuggingFace implementation on 4k sequence length, the throughput of M2-BERT-base can reach 9.1 times!

Table 6 reports the CPU inference times for M2-BERT-base (80M) and BERT-base – results obtained by directly running the PyTorch implementations of these two models.

When sequences are shorter, the impact of data locality still dominates FLOP reduction, while operations such as filter generation (not found in BERT) are more expensive. When the sequence length exceeds 1K, the acceleration advantage of M2-BERT-base gradually increases. When the sequence length reaches 8K, the speed advantage can reach 6.5 times.

Image Classification

In terms of non-causal modeling, in order to verify that the new method has the same advantages on images as on language, the team also evaluated the performance of M2 on image classification tasks.

Table 7 shows the performance of Monarch Mixer, ViT-b, HyenaViT-b and ViT-b-Monarch (replacing the MLP module in standard ViT-b with a Monarch matrix) on ImageNet-1k.

The advantage of Monarch Mixer is very obvious: with half the number of parameters, it can outperform the original ViT-b model. What’s even more surprising is that Monarch Mixer with fewer parameters can outperform ResNet-152; you know, ResNet-152 is specifically designed for the ImageNet task.

Causal Language Modeling

GPT-style causal language modeling is a key application of Transformer. The team built an M2-based architecture for causal language modeling: M2-GPT.

For the sequence mixer, M2-GPT uses a combination of convolutional filters from Hyena, state-of-the-art attention-free language models, and cross-multihead parameter sharing from H3. They replaced the FFT in these architectures with causal parameterization and removed the MLP layer entirely. The resulting architecture is completely devoid of attention and completely devoid of MLP.

They pre-trained M2-GPT on PILE, a standard dataset for causal language modeling. The results are shown in Table 8.

It can be seen that although the model based on the new architecture has no attention and MLP at all, it still outperforms Transformer and Hyena in the pre-trained perplexity index. These results suggest that models that are very different from Transformer may also achieve excellent performance in causal language modeling.

For more information, please refer to the original paper.

Editor: Wen Jing