From single card to multiple cards (DDP usage method, with code) (1)

Distributed training is a common multi-card strategy to accelerate training. Generally speaking, there are two methods to choose from: DataParallel (DP) and DistributedDataParallel (DDP). This article introduces the most commonly used DDP. The following demonstrates how to modify a program running on a single card into a form that can run on a single card or multiple cards.
The single card procedure is as follows:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc = nn.Linear(28 * 28, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.fc(x)
        return x

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root='./', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Adding the test dataset and loader
test_dataset = datasets.MNIST(root='./', train=False, download=True, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

model = SimpleNN().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Initializing the loss function outside of the loops
criterion = nn.CrossEntropyLoss()

#Training
for epoch in range(5):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target) # using the initialized loss function
        loss.backward()
        optimizer.step()

#Testing
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        test_loss + = criterion(output, target).item() # using the initialized loss function
        pred = output.argmax(dim=1, keepdim=True)
        correct + = pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)
print(f'\\
Test set: Average loss: {<!-- -->test_loss:.4f}, Accuracy: {<!-- -->correct}/{<!-- -->len( test_loader.dataset)} ({<!-- -->100. * correct / len(test_loader.dataset):.0f}%)\\
')

It can be seen that it is a simple classification task. Before showing the multi-card running code, let’s understand some concepts.
I understand DDP as each card executing the same training code, but with different process numbers. The training process is also a process. Taking a single machine with four cards as an example, the process numbers are 0, 1, 2, and 3 respectively.
DDP includes two modes: single-machine multi-card (commonly used) and multi-machine multi-card mode. Only the single-machine multi-card mode is introduced here.
Usually we need to import two packages

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist

Set group

Generally, one group is defaulted to
dist.init_process_group(backend='nccl')

world size

Indicates the total number of global processes

torch.distributed.get_world_size()

rank

The serial number of the current process, used for inter-process communication. For the world size of two eight-card servers, it is 0,1,2,…,15.
Note: The process with rank=0 is the master process.

 # Get rank. Each process has its own serial number, which is different.
torch.distributed.get_rank()

local_rank
This is the serial number of the process on each machine. There are 0,1,2,3,4,5,6,7 on machine one, and there are also 0,1,2,3,4,5,6,7 on machine two.

torch.distributed.local_rank()

For single-machine multi-card mode, there is no difference between rank and local_rank, because there is only one machine.

Paste the code below

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import argparse

# Packages needed for ddp
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist

# Set local_rank, the default is set to -1. Because if you use torch.distributed.launch to launch ddp, local_rank is automatically assigned for each card. Start from 0. Remember that the difference between each card's program is the local_rank. Because we want to use the main process to test and save the model
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank',type=int,default=-1)
args = parser.parse_args()

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc = nn.Linear(28 * 28, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.fc(x)
        return x

# Initialize the ddp environment and allocate cards to each process
if args.local_rank >= 0:
    torch.cuda.set_device(args.local_rank)
    dist.init_process_group(backend='nccl')
    ddp=True
else:
    ddp=False
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()
transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root='./', train=True, download=False, transform=transform)
test_dataset = datasets.MNIST('./', train=False, download=False, transform=transform)
# train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# It should be noted here that multi-card training requires the use of ddp to set the sample. Ensure that the batch assigned to each card is different
ifddp:
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_loader = DataLoader(train_dataset, batch_size=32, sampler=train_sampler)
else:
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# The test only needs to be performed on one card because one card is used to test on all test data
test_loader = DataLoader(test_dataset, batch_size=1000)

model = SimpleNN().to(device)

# Use ddp to encapsulate the model and amplify the learning rate. Because the parameters are updated using learning rate * gradient. The model parameters of each card are consistent, and the effective batch increases due to multi-card training. For example, the single-card learning rate was previously 0.01, and the gradient of 0.01*a batch was used for parameter update during backpropagation. Now it is multi-card, so it is the gradient update parameter of the batch of 0.01*gpu number*gpu number
ifddp:
    model = DDP(model, device_ids=[args.local_rank])
    gpu_num = torch.distributed.get_world_size()
else:
    gpu_num = 1


optimizer = optim.SGD(model.parameters(), lr=0.01*gpu_num)

#Training
for epoch in range(5):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    # Only use the main process when testing
    if not ddp or (ddp and dist.get_rank() == 0):
        model.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for data, target in test_loader:
                data, target = data.cuda(), target.cuda()
                output = model(data)
                pred = output.argmax(dim=1, keepdim=True)
                correct + = pred.eq(target.view_as(pred)).sum().item()

        print(
            f"\\
Test set: Accuracy: {<!-- -->correct}/{<!-- -->len(test_loader.dataset)} ({<!-- -->100. * correct / len(test_loader.dataset)}%)\\
")


# ddp uses the main process to save the model, otherwise there is no need to save it multiple times.

if not ddp or (ddp and dist.get_rank() == 0):
    torch.save(model.state_dict(), 'ddp_model.pth')

Finally use the following command to start ddp

python -m torch.distributed.launch --nproc_per_node=2 ddp_code.py

The following is the running result
Single card
Single card results
Doka

Mulka results
It can be seen that the accuracy is basically the same