[Solved] A solution to CUDA out of memory for T5-large model

Let’s talk about the conclusion first: the optimizer is changed to AdaFactor

Recently, I used the Google T5model model for some semantic recovery experiments. I referenced some codes on github. At the beginning, I used the T5-base model to train on the server. I used a 30-series graphics card with a video memory of 8192MiB and batch_size. 32. There is no problem with the full video memory during training, and the effect is not bad.
Then I saw that some papers used the T5-large model, and the effect would be better than the T5-base model. After all, there are more parameters, so I thought about changing the ‘t5-base’ to ‘t5-large’ when calling the model. The big deal is that the batch_size is small. A little, can you run too?

model = T5ForConditionalGeneration.from_pretrained('t5-large')

Then I found that batch_size = 1 does not work, just OOM, go to huggingface to find some tips for training T5, there is a suggestion to change the optimizer AdamW to AdaFactor to run T5-large. Purpose of AdaFactor: To propose a low-cost, memory-intensive alternative to general adaptive optimization methods for large models with a huge number of parameters. A reference parameter is given below.

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)

Although it can only run a small amount of data in the last time, it can at least run, and can compare the difference between base and large. So finetune T5-large still needs powerful hardware support.