PyTorch ~ Optimizing Neural Networks

Here are the 17 least-effort and most effective ways to train deep models with PyTorch. The methods mentioned in this article assume that you are training the model in a GPU environment. The details are as follows.

01 Consider another learning rate schedule

The choice of learning rate schedule has a great influence on the convergence speed and generalization ability of the model. Leslie N. Smith et al. proposed a cyclical (Cyclical) learning rate and a 1Cycle learning rate schedule in the papers “Cyclical Learning Rates for Training Neural Networks” and “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates”. It was later promoted by Jeremy Howard and Sylvain Gugger of fast.ai. The following figure is a diagram of the 1Cycle learning rate schedule:

Sylvain wrote: 1Cycle consists of two steps of equal length, one from a lower learning rate to a higher learning rate, and the other back to the lowest level. The maximum value comes from the value picked by the learning rate finder, and smaller values can be ten times lower. Then, the length of this period should be slightly smaller than the total number of epochs, and, in the final stage of training, we should allow the learning rate to be several orders of magnitude smaller than the minimum value. In the best case, this schedule achieves a huge speedup (Smith calls it super-convergence) compared to traditional learning rate schedules. For example, using the 1Cycle strategy to train ResNet-56 on the ImageNet dataset, the number of training iterations is reduced to 1/10 of the original, but the model performance is still comparable to the level in the original paper. On common architectures and optimizers, this schedule seems to work well.

Pytorch already implements these two methods: “torch.optim.lr_scheduler.CyclicLR” and “torch.optim.lr_scheduler.OneCycleLR”.

Reference documentation: https://pytorch.org/docs/stable/optim.html

02 Use multiple workers and page-locked memory in DataLoader

When using torch.utils.data.DataLoader, set num_workers > 0 instead of the default value of 0 and also set pin_memory=True instead of the default value of False.

Reference documentation: https://pytorch.org/docs/stable/data.html

Szymon Micacz, a senior CUDA deep learning algorithm software engineer from NVIDIA, has achieved a 2x speedup in a single epoch using four workers and page-locked memory (pinned memory). A rule of thumb people choose for the number of workers is to set it to four times the number of GPUs available, anything more or less than this will slow down training. Note that increasing num_workers will increase CPU memory consumption.

03 adjust the batch to the maximum

It is a controversial point to adjust the batch size to the maximum. In general, your training will be faster if you maximize batch size as your GPU memory allows. However, you also have to tune other hyperparameters, such as the learning rate. A good rule of thumb is that when the batch size doubles, the learning rate also doubles.

OpenAI’s paper “An Empirical Model of Large-Batch Training” has a good demonstration of how many steps are needed for different batch sizes to converge. In the article “How to get 4x speedup and better generalization using the right batch size”, the author Daniel Huynh did some experiments with different batch sizes (also using the 1Cycle strategy discussed above).

Ultimately, he increased the batch size from 64 to 512, achieving a 4x speedup. However, the downside of using large batches is that this can lead to solutions that generalize poorer than using small batches.

04 Use Automatic Mixed Precision (AMP)

The PyTorch 1.6 release includes a native implementation of automatic mixed-precision training for PyTorch. The point here is that some operations run faster in half-precision (FP16) than in single-precision (FP32), without loss of accuracy. AMP automatically decides which operation should be performed at which precision. This can both speed up training and reduce memory usage.

In the best case, AMP is used like this:

import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()


for data, label in data_iter:
   optimizer. zero_grad()
   # Casts operations to mixed precision
   with torch.cuda.amp.autocast():
      loss = model(data)

   # Scales the loss, and calls backward()
   # to create scaled gradients
   scaler.scale(loss).backward()

   # Unscales gradients and calls
   # or skips optimizer. step()
   scaler. step(optimizer)

   # Updates the scale for next iteration
   scaler. update()

05 Consider using another optimizer

AdamW is an Adam with weight decay (instead of L2 regularization) popularized by fast.ai and implemented in PyTorch as torch.optim.AdamW. Adam W

Seems to consistently outperform Adam in both error and training time. Both Adam and AdamW work well with the 1Cycle strategy mentioned above.

There are also a few non-local optimizers that are getting a lot of attention right now, most notably LARS and LAMB. NVIDA’s APEX implements fused versions of some common optimizers, such as Adam. Compared to the Adam implementation in PyTorch, this implementation avoids multiple passes to and from GPU memory and is 5% faster.

06 cudNN benchmark

If your model architecture stays the same and the input size stays the same, set torch.backends.cudnn.benchmark = True.

07 Be careful with frequent data transfers between CPU and GPU

It can be very expensive when frequently using tensor.cpu() to move tensors from GPU to CPU (or tensor.cuda() to move tensors from CPU to GPU). The same is true for item() and .numpy() . You can use .detach() instead.

If you create a new tensor, you can assign it to the GPU with the keyword argument device=torch.device( cuda:0 ).

If you need to transfer data, you can use .to(non_blocking=True), as long as there is no synchronization point after the transfer.

08 Use gradient/activation checkpointing

The working principle of Checkpointing is to exchange calculations for memory. Instead of storing all intermediate activations of the entire calculation graph for backward pass, these activations are recalculated. We can apply it to any part of the model.

Specifically, in the forward pass, the function is run with torch.no_grad() and no intermediate activations are stored. Instead, the forward pass will save the input tuple and the function parameter. In the backward pass, the inputs and function are retrieved, and the forward pass is evaluated again on the function. Intermediate activations are then tracked and gradients are computed using these activation values.

So while this may slightly increase runtime for a given batch size, it will significantly reduce memory footprint. This in turn will allow to further increase the batch size used and thus improve the utilization of the GPU.

Although checkpointing is implemented in the torch.utils.checkpoint way, it still takes some thought and effort to get it right. Priya Goyal wrote a nice tutorial covering key aspects of checkpointing.

Priya Goyal tutorial address: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

09 Use gradient accumulation

Another way to increase the batch size is to accumulate gradients in multiple .backward() passes before calling optimizer.step() .

Hugging Face’s Thomas Wolf article “Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & amp; Distributed setups” describes how to use gradient accumulation. Gradient accumulation can be achieved in the following ways:

model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs) # Forward pass
    loss = loss_function(predictions, labels) # Compute loss function
    loss = loss / accumulation_steps # Normalize our loss (if averaged)
    loss.backward() # Backward pass
    if (i + 1) % accumulation_steps == 0: # Wait for several backward steps
        optimizer. step() # Now we can do an optimizer step
        model.zero_grad() # Reset gradients tensors
        if (i + 1) % evaluation_steps == 0: # Evaluate the model when we... ?
            evaluate_model() # ...have no gradients accumulate

This method was mainly developed to circumvent the limitations of GPU memory.

10 Use distributed data parallelism for multi-GPU training

There are probably many ways to speed up distributed training, but the easy way is to use torch.nn.DistributedDataParallel instead of torch.nn.DataParallel. This way, each GPU will be driven by a dedicated CPU core, avoiding DataParallel’s GIL issues. whaosoft aiot http://143ai.com

Distributed training document address: https://pytorch.org/tutorials/beginner/dist_overview.html

11 sets the gradient to None instead of 0

The gradient is set to .zero_grad(set_to_none=True) instead of .zero_grad(). Doing this lets the memory allocator handle gradients instead of setting them to 0. As the docs say, setting gradient to None yields a modest speedup, but don’t expect miracles. Note that this also has disadvantages, see the documentation for details.

Document address: https://pytorch.org/docs/stable/optim.html

12 Use .as_tensor() instead of .tensor()

torch.tensor() always makes a copy of the data. If you are converting a numpy array, use torch.as_tensor() or torch.from_numpy() to avoid copying the data.

13 Turn on debugging tools if necessary

PyTorch provides many debugging tools, such as autograd.profiler, autograd.grad_check, autograd.anomaly_detection. Please make sure to turn on the debugger when you need to debug and turn it off when you don’t need it, because the debugger will slow down your training speed.

14 Use gradient clipping

Regarding the problem of avoiding gradient explosion in RNN, some experiments and theories have confirmed that gradient clipping (gradient = min(gradient, threshold)) can speed up convergence. HuggingFace’s Transformer implementation is a very clear example of how to use gradient clipping. Some of the other approaches mentioned in this article, such as AMP, can also be used.

In PyTorch this can be achieved using torch.nn.utils.clip_grad_norm_.

15 Turn off bias before BatchNorm

Turn off the bias layer before starting the BatchNormalization layer. For a 2-D convolutional layer, the bias keyword can be set to False: torch.nn.Conv2d(…, bias=False, …) .

16 Turn off gradient calculation during validation

To turn off gradient computation during validation, set: torch.no_grad() .

17 Use input and batch normalization

Want to double check that the input is normalized? Is batch normalization used?

Original link: https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/