Numpy implements GPT’s decoder to generate old poems

numpy implements GPT’s decoder to generate old poems

numpy_transformer/gpt/gpt_train_potry3000.py at master · ZouJiu1/numpy_transformer (github.com)?github.com/ZouJiu1/numpy_transformer/blob/master/gpt/gpt_train_potry3000.py

Mainly using the neural network layer previously written in numpy,

numpy_transformer/net at master · ZouJiu1/numpy_transformer (github.com), including full connect, softmax, embedding, attention, and decoder layers. See the following for details. The train accuracy can reach 96%.

9. Is it a random name: numpy implements forward propagation and back propagation of the embedding layer

Forward propagation and back propagation of loss function – Zhihu (zhihu.com)

Forward propagation and backpropagation of fully connected layers – Zhihu (zhihu.com)

9. Is it a random name: numpy implements forward propagation and back propagation of layernorm

9. Is it a casual name: numpy implements forward propagation and back propagation of the multi-attention layer

9. Is it an arbitrary name: forward propagation and back propagation of the loss bce and bcewithlogits functions?

9. Is it a random name: the mask used by attention in the transformer network

Data

The data is from:

Werneror/Poetry: Very comprehensive ancient poetry data, including a total of more than 850,000 ancient poems from pre-Qin to modern times. (github.com)

use

hanlp segmented the words and then counted the word frequencies, and sorted them according to the word frequency. After getting the first 3,000 high-frequency words, it traversed all the poems without using poems containing other words, and filtered out a total of 6,000 lines of poems.

train_3000.txt, used for train, also gets the first 6,000 high-frequency words, removes poems containing other words, and selects a total of 30,000 lines of poems.

train_6000.txt.

In this case, the file train_3000.txt is used, which is the poem where the first 3000 high-frequency words are trained. The program used for word segmentation is

numpy_lstm_RNN/tokenlzh.py at master · ZouJiu1/numpy_lstm_RNN (github.com).

decoder

The decoder of gpt is first position embedding + word embedding. Position embedding uses a fixed cos value, and word embedding uses a regular embedding layer, which is the Position_Embedding layer in the PatchEmbed.py file.

Then the decoder layer attdecoderblock_layer is connected. The back propagation of softmax requires special attention. It needs to use the Jacobian matrix for derivation, and it cannot be directly accumulated. It also needs to perform matrix operations with the gradient passed by the loss function. That is, the file attdecoderblock.py includes the forward propagation and reverse propagation of the decoder.

train

python gpt_train_potry3000.py

The generated ancient poetry display

run

python gpt_predict_poetrythree.py

The ancient poetry generated by the trained model uses the model gpt_poetry3000_iters1999_1_loss_3259.634242.pkl and the script is gpt_predict_poetrythree.py

My concubine is like a flower by the river, and you are like the water on the river. The flowers fall with the water, but the east wind cannot blow them. My concubine’s family was in the east of Hengtang, and I met Lang Zha. There is no need to ask when a man comes, because parasol trees are planted outside the door. The day is quiet and warm, the breeze is gentle, and the curtains are drawn down to the few guests. The two swallows painted on the beams do not dare to fly near others. The water holds the isolated village far away, and the mountains are connected together

Go to the restaurant where you are worried. The dangerous bridge is an ancient temple, and I enjoy being with the monks while I am relaxing. The autumn rain falls in Jipu, and the boat lights up at night. The wind sinks and the words are far away; the tide rises and the moon rises. Everything is empty and it seems that it is impossible to concentrate on it. The lanterns passed by among the pines at dusk, and Gu Ying was at the end of the world. Cranes are lost in the dim snow, and flowers bloom early due to the cold spring. Difficulties make you understand the taste of the world, and poverty, illness, and years are a nuisance. motherland

The horse flies like dust. When the leaves are thick, you will know the denseness of the willows; when the flowers are gone, you will know the sparse plum blossoms. Lan Sheng cannot be grasped, and Pu Xiao cannot be written. The plum blossoms contain this spring tree, and they are still near the Xianri Pond. People remember the past years when they are pregnant,
The flowers bloom on old branches. The pool has been closed for a long time, and the forest flowers are slightly thicker. The wind blows into the flowers and branches, and the sunlight floats on the water. The moon in the tall building in the empty courtyard is no longer three or five circles.

The smoke is rising straight from the garrison, and the sun is setting late on the flat sand. The dew is cold, the gold palm is heavy, and the sky is close to the jade rope, which is low. The frightened cicadas move away from the ancient willows, and the fighting birds fall into the cold courtyard. Crane's message across the sea, monk and white cloud's poem.
The birds are dark, the wind is sinking, the sky is clear and the moon is rising. I didn't know my surname for many years, but I moved around in a few days. The noisy wind produces the end of the trees, and the late scenery enters the heart of the spring. The Xiangyun clouds follow the wild geese,

It's like a traveling sickness. Baoqin conveys this meaning, what year does Qiyue plan? At dusk, I lean on the streamer and listen to the song, feeling sorry for myself. Shi Shi tended to clear the ban, and Cheng En went to Zhilu. I can get wine on credit for the remainder of the bottle,
She accumulated books and sent them books. What's the point of carving phoenixes? The habit of carving insects has not been eliminated. The reason is that there are few things in the world, so what is the meaning of loneliness? The brocade stone is covered with cold beauty, and the autumn light is full of guest sentiment. color increase

One day, he finally changed who came first, and talked about pity for him and the money for drinks. Don't worry about wealth and wealth. Heaven doesn't care about it. It doesn't care about hills and valleys. When the rain rises, the pond is as green as the Huaihe River, and half of the spring water fills the eyes.
If you don't have money, you have to buy a boat. Don't worry about the past and the present. Thousands of families look at each other but can't understand each other, and they know that spring is in the sound of rain. The donkey is about to go away, worried about the slippery mud, and flies six feet to the west.

Who has been lonely for ten years to see this branch? The flowers are blooming but no one knows yet. I sing under the flowers and think about myself. If the flowers could talk, they would laugh at me. They would only write poems every year without wine. The boat sailed from Xishan late and wanted to go north.
The wind blows me back home. It's not easy to come every ten years. Watch the rice flowers in the independent plain. The newly opened bamboo paths are stored for many autumns, and I have seen them every time when I brought wine with me. The moon has not yet risen

Go to the restaurant where you are worried. The dangerous bridge is an ancient temple, and I enjoy being with the monks while I am relaxing. The autumn rain falls in Jipu, and the boat lights up at night. The wind sinks and the words are far away; the tide rises and the moon rises. Thinking about everything in vain,
It always fails. The lanterns passed by among the pines at dusk, and Gu Ying was at the end of the world. Cranes are lost in the dim snow, and flowers bloom early due to the cold spring. Difficulties make you understand the taste of the world, and poverty, illness, and years are a nuisance. motherland

Suddenly Sun and Sun were studying together. The clouds are falling and the autumn rain is falling, and the rooster sends the dawn window light. The door serves as a carriageway, and the curtain separates fame and dust. The clouds spread thousands of miles away, and the wind moves the stars. The mountains are green and rainy,
Green floating near the city smoke. The moonlight is good all the time, but people's hearts are biased this night. The spring water is beautiful, and the wild clouds have no vulgar appearance. When the source comes from its own source, life will be peaceful and harmonious

Return to weakness. When I asked about the situation in Mokong Mountain, I called in the poet's case. The clouds are about to open but don't. I ask the sky to find a good wind to urge me. There is not much rain, the mud is slippery, and the banks of the stream are so deep that they are slumped.
Add all the red stove and wear all the clothes, and you will feel as warm as a cup of tea. People say that the cold behind the frost is helpless, but the spring is in the urn and the channel is unknown. The plum blossoms are not too sparse and the apricots are too numerous. What is the master?

It's going to rain. Sorry, I am pregnant with someone, but my Taoism is blocked. I care about this area and caress it again. Good times pass by, but they pass away without me. Meaning is not as good as righteousness, and righteousness is not as good as benefit. Benefit brings people,
Can forget life and death. Benefit is not as good as righteousness, and righteousness is not as good as intention. The intention makes people move the world. Stay in darkness and watch the light, stay still and watch the movement. Live in simplicity and view complexity, live in lightness

Longer candles and more flowers. Watch the bathing rabbit in one part, and listen to the croaking frog in the other part. If you can sincerely gain your original aspiration, why bother returning to your original aspiration? There is no need to guard against being expelled from the outside. There are few people arriving at the wild post station.
The empty garden grass grows on its own. I haven't noticed the clear frost, and the drizzle is even more sunny. The old house is broken by the wind, and the empty forest is dry after the rain. The falling yellow leaves are full, the lonely wild fragrance remains

The moon is in the west of Hualou. My concubine is like a flower by the river, and you are like the water on the river. The flowers fall with the water, but the east wind cannot blow them. My concubine's family was in the east of Hengtang, and I met Lang Zha. Lang Lai doesn’t need to ask,
Plane trees are planted outside the door. The day is quiet and warm, the breeze is gentle, and the curtains are drawn down to the few guests. The two swallows painted on the beams do not dare to fly near others. The water holds the isolated village far away, and the mountains are connected together

It's after dusk on Heshan Mountain. When the waves in the courtyard begin to widen, the water in the gorge is always cloudy. The dream of the soul is still hard to come, and the gray hair of the king is invaded by sorrow. The water is as calm as the trees, but the beach is as restless as the wind. The sound of heartbreak for hundreds of miles,
Back then, wanderers listened. There is no going back, this trip is as safe as you want. The heart of flowing water through the ages is still in loneliness. It is difficult to violate the nature of sparseness, and the moss wilderness and alleys are deep. Huang Yeyu arrives at the door

Clouds appear everywhere. After the falling snow and frost, there is still a thousand years of spring. A fossil plant, who is the person who planted it? The Buddha's mind can be seen everywhere, emerging more and more clearly. No need to turn on the lights,
The sky is high and the moon is self-generated. The universe is numbered five, and the sun and the moon are in harmony. But it is difficult to describe the meritorious service. The flags and poles by the water are dangerous, and the talismans are divided into different graces.

There is nothing you can do if you look up to the sky. A cup of turbid wine will leave me with unresolved worries, and a pot of spittle will break into pieces without making a song. The fragrance of mignonette penetrates the clouds across the mountains, remembering that the roots are separated from the sea. Hate to kill the evil west wind that comes at night,
A branch destroys the place and worries you. God's will is shallow and profound for people, but how can people tolerate God's will? One line and one stop only lasts for a moment, and this way is as majestic as it has been since ancient times. Alchemy Sword Guilian

On the mirror sun. I remember the years before I was pregnant, and the flowers and hair are old branches. The pool has been closed for a long time, and the forest flowers are slightly thicker. The wind blows into the flowers and branches, and the sunlight floats on the water. The sky is full of moon in the empty courtyard,
It’s not three or five circles. Why bother looking at the bed when you will end up sleeping alone. Don't complain about the sound of mournful songs, and leave the dance clothes wet with cries. May the false bird live in a song and fly from the south. Sanzhou Duanjiangkou

The moon rises and autumn comes to drunkenness, and the sound of empty fasting falls at night. I was shocked by last night's dream across the bed, hiding a few words in my life. The Deng Jing book is still being read, and the sentence about selling fragrance suddenly comes into being. When we look at each other for years, I also use my affection.
I admire Wu Songqu and come to seek alliance with Du Leng. Mistakenly hearing the rain on the bed, it was called the sound of the awning hitting. The leakage is slow and the drops are replaced, and spring is born from the water station. The dawn clouds drive away the shadows, I want to take advantage of the new sunshine

program

import os
abspath = os.path.abspath(__file__)
filename = os.sep.join(abspath.split(os.sep)[-2:])
abspath = abspath.replace(filename, "")
importsys
sys.path.append(abspath)

from net.loss import cross_entropy_loss
import numpy as np
import pickle
from net.layernorm import layer_norm
from PatchEmbed import Position_Embedding
from attdecoderblock import attdecoderblock_layer
from net.layernorm import layer_norm
from net.fullconnect import fclaer
from gpt.gpt_linear import gpt_linear_layer
import re
from classify import classify_layer
from net.flatten import flatten_layer

from copy import deepcopy
import json

# https://en.wikipedia.org/wiki/AlexNet
# https://pytorch.org/vision/stable/_modules/torchvision/models/alexnet.html#alexnet
# https://github.com/l5shi/Image-Recognition-on-MNIST-dataset/blob/master/AlexNet.ipynb

def getdata():
    dataset = os.path.join(abspath, 'dataset')
    os.makedirs(dataset, exist_ok=True)
    id2char_char2id = os.path.join(abspath, 'dataset', r"gptpoetry3000.json")
    # inpath = os.path.join(abspath, 'dataset', r"train_10000.txt")
    
    inpath = r'C:\Users\10696\Desktop\access\\
umpy_transformer\dataset\train_3000.txt'
    with open(inpath, 'r', encoding='utf-8') as obj:
        readcontent = obj.read()
    kk = [i if i!='\\
' else " " for i in readcontent]
    kk = "".join(kk)
    kk = re.sub(r' ', " ", kk)
    kk = re.sub(r' ', " ", kk)
    kk = list(kk)
    # inpath = os.path.join(abspath, 'dataset', r"train_token_1000.txt")
    # with open(inpath, 'r', encoding='utf-8') as obj:
    # for i in obj.readlines():
    # kk.extend(i.strip().split(" "))

    while '□' in kk:
        kk.remove("□")
    unique = np.unique(kk)
    length = len(unique)
    id2char = {i:char for i, char in enumerate(unique)}
    char2id = {char:i for i, char in enumerate(unique)}
    if not os.path.exists(id2char_char2id):
        with open(id2char_char2id, 'w', encoding='utf-8') as obj:
            json.dump({"id2char":id2char, 'char2id':char2id}, obj, indent=2, separators=(",", ":"), ensure_ascii=False)
    else:
        with open(id2char_char2id, 'r', encoding='utf-8') as obj:
            jsonfile = json.load(obj)
        id2chark = jsonfile["id2char"]
        char2id = jsonfile["char2id"]
        length = len(id2char)
        id2char = {}
        for key, value in id2chark.items():
            id2char[int(key)] = value
    return length, id2char, char2id, kk

def create_masks_future(inputs):
    #future
    n, sequence_length = inputs.shape
    input_mask = np.tril(np.ones((sequence_length, sequence_length)))
    input_mask[input_mask==0] = -np.inf
    # input_mask[input_mask==0] = -1e6
    input_mask[input_mask==1] = 0
    return input_mask

def create_masks_pad(input_mask):
    #pad
    input_mask = np.array(input_mask)
    n, sequence_length = input_mask.shape
    k1 = input_mask[:, None, :]
    k2 = np.ones_like(input_mask)[:, :, None]
    k = k1 * k2
    k = (1.0 - k)
    k[k==1.0] = -np.inf
    return k

# k = create_masks_pad([[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 0]])

def getinputs(context_length, batchsize, input_texts, char2id, id2char):
    inputs = []
    label = []
    input_mask = []
    id_start = np.random.randint(0, len(input_texts) - context_length -1, (batchsize))
    markedchar = [',', '. ']
    for id in id_start:
        tmp = [char2id[ci] for ci in input_texts[id : id + context_length + 1]]
        # inputchar = "".join([id2char[ci] for ci in tmp])
        # input_mask.append([1 for ci in range(context_length-1)])
        # input_mask[-1].extend([0])
        inputs.append(tmp[:-1])
        label.append(tmp[1:])
    inputs = np.array(inputs)
    if len(input_mask)==0:
        input_mask = np.ones_like(inputs)
            
    input_mask_fut = create_masks_future(inputs)
    # input_mask_pad = create_masks_pad(input_mask)
    input_mask = input_mask_fut
    label_single = np.array(label) #.reshape(-1)
    
    return inputs, input_mask, label_single

def transformer_image_train():
    vocab_size, id2char, char2id, input_texts = getdata()

    all_steps = 3000 - 1000
    batchsize = 63 + 1
    learning_rate = 0.003 # batchsize
    embed_dim = 192 ## vocab_size if vocab_size%3==0 else (vocab_size//3) * 3 + 3 # 192
    num_layer = 10 + 1 + 1
    num_h = [3] * num_layer
    context_length = 100

    ADAM=True
    cls_token = True
    float32 = True

    logfile = os.path.join(logdir, 'log_gpt_poetry3000.txt')
    fpwrite = open(logfile, 'w', encoding='utf-8')

    patchemb = Position_Embedding(context_length, vocab_size, embed_dim, adam=ADAM)
    layers = [patchemb]
    
    at0 = attdecoderblock_layer(embed_dim, num_h[0], adam=ADAM, float32=float32)
    at1 = attdecoderblock_layer(embed_dim, num_h[1], adam=ADAM, float32=float32)
    at2 = attdecoderblock_layer(embed_dim, num_h[2], adam=ADAM, float32=float32)
    at3 = attdecoderblock_layer(embed_dim, num_h[3], adam=ADAM, float32=float32)
    at4 = attdecoderblock_layer(embed_dim, num_h[4], adam=ADAM, float32=float32)
    at5 = attdecoderblock_layer(embed_dim, num_h[5], adam=ADAM, float32=float32)
    at6 = attdecoderblock_layer(embed_dim, num_h[6], adam=ADAM, float32=float32)
    at7 = attdecoderblock_layer(embed_dim, num_h[7], adam=ADAM, float32=float32)
    at8 = attdecoderblock_layer(embed_dim, num_h[8], adam=ADAM, float32=float32)
    at9 = attdecoderblock_layer(embed_dim, num_h[9], adam=ADAM, float32=float32)
    at10 = attdecoderblock_layer(embed_dim, num_h[10], adam=ADAM, float32=float32)
    at11 = attdecoderblock_layer(embed_dim, num_h[11], adam=ADAM, float32=float32)
    # at12 = attdecoderblock_layer(embed_dim, num_h[12], adam=ADAM, float32=float32)
    # at13 = attdecoderblock_layer(embed_dim, num_h[13], adam=ADAM, float32=float32)

    # layers + = [at0, at1, at2, at3, at4, at5, at6, at7, at8, at9, at10, at11, at12]
    layers + = [at0, at1, at2, at3, at4, at5, at6, at7, at8, at9, at10, at11]
    # layers + = [at0, at1, at2, at3, at4, at5, at6]

    norm = layer_norm(embed_dim, adam=ADAM)
    # if not cls_token:
    # cll = classify_layer(embed_dim, batchsize, 1, vocab_size, cls_token, adam=ADAM, relu=False, float32=float32)
    #else:
    cll = fclayer(embed_dim, vocab_size, True, adam=ADAM, float32=float32)
    layers + = [norm, cll]

    datapath = os.path.join(abspath, 'dataset')
    os.makedirs(datapath, exist_ok=True)
    modelpath = os.path.join(abspath, 'gpt', 'model')
    os.makedirs(modelpath, exist_ok=True)

    if os.path.exists(pretrained_model):
        with open(pretrained_model, 'rb') as obj:
            models = pickle.load(obj)
        cnt = 0
        for l in layers:
            k = dir(l)
            if 'restore_model' in k and 'save_model' in k:
                l.restore_model(models[cnt])
                cnt + = 1
        del models

    alliter = 0
    lr = learning_rate
    start_epoch = 1
    try:
        if os.path.exists(pretrained_model):
            start_epoch = int(pretrained_model.split(os.sep)[-1].split("_")[3]) + 1
    except:
        start_epoch = 1
    while alliter < all_steps:
        meanloss = 0
        jk = 0
        pre_col = []
        while True:
            if alliter > all_steps:
                break
            if alliter <= 100:
                lr = learning_rate * alliter / 100
            if alliter==23*all_steps//30:
                lr = learning_rate * 0.1
            elif alliter==28*all_steps//30:
                lr = learning_rate * 0.1 * 0.1
            alliter + = 1
            jk + = 1
            inputs, input_mask, label_single = getinputs(context_length, batchsize, input_texts, char2id, id2char)

            for l in range(len(layers)):
                if isinstance(layers[l], attdecoderblock_layer):
                    inputs = layers[l].forward(inputs, input_mask)
                else:
                    inputs = layers[l].forward(inputs)

            ishape = inputs.shape
            inputs = np.reshape(inputs, (-1, vocab_size))
            labels = np.zeros_like(inputs)
            labels[np.arange(len(inputs)), label_single.reshape(-1)] = 1
            loss, delta, predict = cross_entropy_loss(inputs, labels)
            # loss = loss * batchsize
            # delta = delta * batchsize
            delta = np.reshape(delta, ishape)
            
            # delta = np.zeros_like(inputs)
            # loss = 0
            # predict = np.zeros_like(inputs[0])
            # for ik in range(batchsize):
            # labels = np.zeros_like(inputs[ik])
            # labels[np.arange(len(inputs[ik])), label_single[ik]] = 1
            # losskkk, deltakkk, predictkkk = cross_entropy_loss(inputs[ik], labels)
            # delta[ik, :, :] = deltakkk
            # loss + = losskkk
            # predict = np.concatenate([predict, predictkkk], axis = 0)
            # predict = predict[32*16//2:, :]
            # delta *= batchsize
            # loss *= batchsize
            for l in range(len(layers)-1, -1, -1):
                delta = layers[l].backward(delta)
                layers[l].update(lr)
                layers[l].setzero()

            p = np.argmax(predict, axis=-1)
            precision = np.sum(label_single.reshape(-1)==p) / len(p)
            pre_col.append(precision)
            meanloss + = loss
            i = alliter * (context_length + 1) // len(input_texts)
            if alliter0==0:
                inputs, input_mask, label_single = getinputs(context_length, batchsize, input_texts, char2id, id2char)
                for l in range(len(layers)):
                    if isinstance(layers[l], attdecoderblock_layer):
                        inputs = layers[l].forward(inputs, input_mask)
                    else:
                        inputs = layers[l].forward(inputs)
                ishape = inputs.shape
                inputs = np.reshape(inputs, (-1, vocab_size))
                labels = np.zeros_like(inputs)
                labels[np.arange(len(inputs)), label_single.reshape(-1)] = 1
                # k = np.sum(labels, axis = -1)
                _, _, predict = cross_entropy_loss(inputs, labels)
                p = np.argmax(predict, axis=-1)
                valpre = np.sum(label_single.reshape(-1)==p) / len(p)
                output = ''.join([id2char[int(ij)] for ij in p[:(len(p)//batchsize)]]) + "\\
"
            else:
                output = "\\
"
                valpre=0
        
            fpwrite.write("epoch:{}, lr: {:.6f}, loss: {:.6f}, iters: {}, precision: {:.6f}, valpre: {:.6f}\\
 {}". \
                    format(i, lr, loss, str(jk) + "_" + str(alliter), precision, valpre, output))
            fpwrite.flush()
            
            # savemodel
            if (alliter + 1) % 100==0:
                allmodel = []
                for l in layers:
                    k = dir(l)
                    if 'restore_model' in k and 'save_model' in k:
                        allmodel.append(l.save_model())
                name = f"gpt_poetry3000_iters{alliter}_" + str(i) + "_loss_" + str(round(meanloss, 6)) + ".pkl"

                with open(os.path.join(modelpath, name), 'wb') as obj:
                    pickle.dump(allmodel, obj)
        meanloss /= jk

        fpwrite.write("epoch: {}, {}\\
\\
".format(i, ''.join(output[:200])))
        fpwrite.flush()
    fpwrite.close()

if __name__ =="__main__":
    savepath = abspath
    pretrained_model = r''
    logdir = os.path.join(savepath, 'gpt', 'log')
    os.makedirs(logdir, exist_ok=True)
    transformer_image_train()

'''
https://github.com/google-research/vision_transformer/blob/main/vit_jax/models_vit.py
https://github.com/UdbhavPrasad072300/Transformer-Implementations/blob/main/notebooks/MNIST Classification - ViT.ipynb
https://github.com/s-chh/PyTorch-Vision-Transformer-ViT-MNIST/tree/main
https://itp.uni-frankfurt.de/~gros/StudentProjects/WS22_23_VisualTransformer/
A Naive Transformer Architecture for MNIST Classification Using PyTorch
https://medium.com/mlearning-ai/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c https://github.com/BrianPulfer/PapersReimplementations/blob/main/vit/vit_torch.py https://github.com/microsoft/Swin-Transformer https://huggingface.co/docs/transformers/v4.27.0/model_doc/vit '''

https://github.com/google-research/vision_transformer/blob/main/vit_jax/models_vit.py

https://github.com/UdbhavPrasad072300/Transformer-Implementations/blob/main/notebooks/MNIST Classification – ViT.ipynb

https://github.com/s-chh/PyTorch-Vision-Transformer-ViT-MNIST/tree/main https://itp.uni-frankfurt.de/~gros/StudentProjects/WS22_23_VisualTransformer/

A Naive Transformer Architecture for MNIST Classification Using PyTorch

https://medium.com/mlearning-ai/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c

https://github.com/BrianPulfer/PapersReimplementations/blob/main/vit/vit_torch.py

https://huggingface.co/docs/transformers

https://github.com/microsoft/Swin-Transformer

https://zhuanlan.zhihu.com/p/659018819