Musk’s first xAI research results released! How to train ResNet to unlimited depth?

Click the Card below and follow the “CVer” public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter->[Computer Vision and Transformer] Communication Group

Reply in the background of the CVer WeChat public account: unlimited depth, you can download the pdf and code of this paper, and learn quickly!

West Wind Fish Sheep comes from Ao Fei Temple
Reprinted from: Qubit (QbitAI)

Musk’s xAI, the first public research results are here!

One of the co-authors is Greg Yang, a founding member of xAI and a disciple of Qiu Chengtong.

Previously, Yang Ge has publicly stated that his research direction in xAI is “Math for AI” and “AI for Math”.

One of the key points is to continue his previous research:

Tensor Programs, a unified programming language for describing neural network architecture – Related results have been applied in GPT-4.

This new paper belongs to this series and focuses on “how to train infinitely deep networks”.

de9ece26083a51e0c37acc5ef1818a9a.png

To this end, Yang himself also conducted a live broadcast on the Internet to share.

Let’s take a look at what exciting content is worth marking~

Scan the QR code to join CVer Planet, and you can quickly learn the paper ideas and CVs from the latest top conferences and journals, from entry-level to proficient information, as well as the most cutting-edge projects and applications!

e713d499073794a2c587f09fdb7578f5.jpeg

Train infinite deep neural networks

Simply put, this article studies the extension of the residual network (ResNet) in the depth direction.

We know that the residual network solves the problem of performance degradation of deep convolutional neural networks when the depth increases. But as the network continues to deepen, training a good deep residual network is still not easy:

When the network deepens, the size of the features will continue to increase, causing network instability; after deepening the network, the hyperparameters need to be readjusted, which requires a lot of work…

The idea of Yang Ge and his friends is to find a deep parameterization method that can both learn features and achieve hyperparameter transfer.

They first thought of two limiting cases of infinitely wide neural networks: either kernel machines or feature learners. For the latter, the optimal hyperparameters do not change with width.

2fa5086b83c3585e349f1e4a821af5f2.png

Here, they use the Tensor Programs framework to analyze the extreme case of infinitely wide networks.

As mentioned earlier, Tensor Programs is a long-term research goal of Young: Using mathematical language to establish a low-level programming language that can describe and analyze neural network architecture.

64c0f0d765fa213db13143f28e788d06.png

Specifically, Tensor Programs consist of matrix multiplication and activation functions. Young discovered that if the neural network function could be expressed in this language, initialization analysis could be performed automatically and completely.

The mathematical derivation part will not be elaborated here. We can briefly feel the style of painting…

b8a4f6d0c431636502a024609f31e45f.gif

On the basis of these derivation analyses, the author proposed the Depth-μP method, which can realize hyperparameter migration in the depth direction and greatly simplifies the hyperparameters at different depths. adjust.

Depth-μP contains the following points:

  • Each residual branch has a coefficient a/sqrt(L) inversely proportional to the square root of depth L.

  • The learning rate of each weight matrix decreases as the depth L becomes larger, depending on the type of optimization algorithm. For SGD, the learning rate is a constant η, and for adaptive optimization algorithms such as Adam, the learning rate is eta/sqrt(L).

It is worth noting that the author found that when the depth of the residual block is 1, Depth-μP is the optimal method of depth parameterization, which can ensure that the hyperparameters converge with the increase of depth and realize the hyperparameter transfer in the depth direction.

c0ac4ada1b749d98c39fd303f8f8a395.png

However, when the residual block depth ≥ 2, there will still be problems with hyperparameter migration failure and training performance degradation.

ab46c40e533a1ac1d5d7fe45d1e441c6.png

In addition, the paper also explores the concept of “feature diversity” and believes that it plays a key role in deep networks.

Another co-author of the paper is Dingli Yu from Princeton. He graduated from the Yao Class of Tsinghua University and is currently pursuing a Ph.D. in the Department of Computer Science at Princeton.

What did Yang Ge say during the live broadcast?

During the live broadcast, Yang Ge also answered questions that the audience was interested in. Without changing the original meaning, Qubit has sorted out some of the issues.

Q: For many of us, (the paper content) may be beyond our understanding. But I want to know, how is the model you mentioned different from the ChatGPT and OpenAI technologies we can experience? What are the significant differences or innovations between this paper and OpenAI’s results?

Younger: Let me make a brief comment. I would like to say that these features are not directly related to practical applications at present, but are more of a research nature.

Of course, the ultimate goal of doing all this is to make the model better and safer, and then benefit mankind. What we are doing now is describing the expected effect, which does not necessarily have a direct impact.

We’re all in the same boat now, and we’re doing what we can, whether it’s short-term work or long-term applied research, to make it work for everyone.

Q: It sounds like you are building an artificial computer brain capable of reasoning, so is that what you are working on? Also, I am a mother and my 7-year-old son is very interested in mathematics. Do you have any advice on how to keep him interested and enthusiastic in the field of AI?

Younger: “New Network” refers to artificial neural networks. I think it is the backbone of many modern technologies, including Google, Facebook, Instagram, etc. that you use every day. These services are all used at the bottom these artificial neural networks. These networks were inspired by real neural networks in animals and humans about sixty or seventy years ago, but they have deviated from real neuroscience.

These networks are inherently mathematical problems, so we can gain a deeper understanding of these neural networks by doing a lot of analysis after mastering these new mathematical problems.

Although we don’t yet know how neurons are actually connected, through mathematical research we can optimize these artificial neural networks to help technology companies improve people’s lives.

Regarding your second question, it’s great to hear that your son is very interested in math. This is the foundation for creating great things in technology and improving everyone’s lives.

The advice I would like to give is that it is very important that you maintain your son’s passion for math first. Once you lose this love, it will be difficult to continue learning.

Also pay attention to observing the things he likes to make the learning process interesting and further stimulate his interest. At the same time, we should also cultivate his curiosity about how things work, and try to develop a scientific thinking, and do research driven by curiosity. Like taking things apart and trying to understand how they work.

If a person loses his passion for exploring the mathematical truths of the universe, it may be difficult to have the motivation to move forward. Overall, I recommend that you develop in your son a strong interest and curiosity about the world, especially about the nature of mathematics and science.

Q: I have a more abstract question. You had the idea that depth approaches infinity, and then you wrote this paper based on that idea. So have you considered using different architectures of neural networks? Not a standard architecture with neurons and countless layers, but something completely different. Like these neurons are connected in a completely different way, maybe some kind of square shape?

Younger: In fact, the insights into nonlinearity and the number of layers in our work are only very preliminary research. There are certainly many questions that can be explored about what is an appropriate structure, or what a structure should be.

The Meta team has previously studied what happens when neurons are randomly connected and obtained some interesting results. So, there’s definitely a lot to do here. Now I really don’t have a concrete answer as to what would be the correct or better structure.

About Yange

Yang Ge was born in Hunan Province. After graduating from elementary school, he went to the United States and studied under Professor Yau Shing-tung at Harvard.

1b07a5ca7da54d40514856fc9382dc83.png < /h6>

Yang Ge and Qiu Chengtong, picture source: Yang Ge Twitter

In 2017, Yang Ge graduated from Harvard and later entered Microsoft under the recommendation of Shen Xiangyang.

At Microsoft, Yang Ge received high praise from Shun Xiangyang. A few months ago, at a forum called “Basic Science and Artificial Intelligence”, Shen Xiangyang publicly stated:

Microsoft Research usually only recruits doctoral students. Yang Ge entered Microsoft Research as an undergraduate graduate. Not only did he join Microsoft Research, he also did extremely well in the past five years, especially making a decisive contribution to the development of GPT.

It is worth mentioning that he himself admitted that GPT-4 used his μTransfer (Tensor Programs series) method.

Yang Ge’s research on Tensor Programs started very early. He published “Tensor Programs I” in 2019 and continued to explore in depth while working at Microsoft. He believes that almost any calculation in deep learning can be expressed as Tensor Programs.

In July this year, Musk announced the establishment of a new company, xAI. Young left Microsoft to join the founding team of xAI and became a mathematician at xAI.

After joining xAI, Yang Ge revealed more than once that the long-term goal of the Tensor Programs project is to develop a “theory of everything” for large-scale deep learning, that is, to find a theoretical rule that can truly understand AI Behavior of large models.

He also said:

AI will enable everyone to understand our mathematical universe in ways previously unimaginable.

Paper link: https://arxiv.org/abs/2310.02244

Reply in the background of the CVer WeChat public account: unlimited depth, you can download the pdf of this paper, and start learning quickly!

Click to enter-> [Computer Vision and Transformer] Communication Group

ICCV/CVPR 2023 paper and code download


Backend reply: CVPR2023, you can download the CVPR 2023 papers and code open source paper collection

Backend reply: ICCV2023, you can download the collection of ICCV 2023 papers and code open source papers
Computer Vision and Transformer exchange group established
Scan the QR code below, or add WeChat: CVer444, to add CVer assistant WeChat, and then apply to join the CVer-Computer Vision or Transformer WeChat communication group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, automatic Driving, reinforcement learning, lane detection, model pruning & compression, denoising, fog removal, rain removal, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication , PyTorch, TensorFlow and Transformer, NeRF, etc.
Be sure to note: Research direction + location + school/company + nickname (such as computer vision or Transformer + Shanghai + hand in + Kaka). Note according to the format to be approved and invited to the group faster


▲Scan the QR code or add WeChat ID: CVer444 to join the communication group
CVer Computer Vision (Knowledge Planet) is here! If you want to know about the latest, fastest and best CV/DL/AI paper express delivery, high-quality practical projects, AI industry cutting-edge, and learning tutorials from entry to mastery, please scan the QR code below and join CVer Computer Vision, which has gathered thousands of people!

▲Scan the code to enter the planet

▲Click on the card above to follow the CVer official account

It’s not easy to organize, please like and watch< /strong>a576546c90ccdd47d9d38a81d9d3d501.gif