Use ChaosBlade to verify DLRover’s resiliency and fault-tolerant stability

Text | Wang Qinlong (Flower name: Chang Fan)

Ant Group AI Systems Engineer

ChaosBlade is an open source experimental injection tool developed by Alibaba that follows chaos engineering principles and chaos experiment models and can be used to verify the stability of cloud native systems. As a cloud-native distributed training system, DLRover provides elasticity and fault-tolerance functions to improve the stability of distributed training. To this end, we use ChaosBlade to create various chaos experiments to verify the stability of DLRover’s elastic fault tolerance.

Prerequisite: Create a k8s cluster and deploy DLRover ElasticJob

Create a k8s cluster, and the cluster can be accessed locally through Kubectl. In the following experiments, we used Alibaba Cloud’s ACK cluster.
To deploy DLRover ElasticJob on a k8s cluster, please refer to the documentation.
To create a training image and install chaosblade in the image, please refer to the Dockerfile. At the same time, we also provide an image for training mnist, registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:torch201-mnist. We will use this image in the following experiments.

Python distributed training elastic fault tolerance

We will conduct experiments to simulate the following scenarios to verify the elastic fault tolerance function of DLRover distributed training:

The training Pod is preempted or evicted.
The training pod is a slow node.
The training Pod was scheduled on a faulty machine.
During the training process, the network of the training node fails.
The training process crashed in the training pod.
Automatic expansion and contraction of training.

Training Pod was preempted

In this experiment, we use the MNIST example to verify that DLRover can recover the preempted Pod. We replace command in the job yaml with the following command:

 command:
    -/bin/bash
    - -c
    - "dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

Submit a 4-node job

 kubectl -n dlrover apply -f examples/pytorch/mnist/chaos_test_job.yaml

After submitting the job, we can see the following Pods through kubectl -n dlrover get pods:

chaos-test-edljob-worker-0 1/1 Running 0 85s
chaos-test-edljob-worker-1 1/1 Running 0 85s
chaos-test-edljob-worker-2 1/1 Running 0 85s
chaos-test-edljob-worker-3 1/1 Running 0 85s
elasticjob-chaos-test-dlrover-master 1/1 Running 0 89s

We manually delete a Pod to simulate the Pod being preempted.

kubectl -n dlrover delete pod chaos-test-edljob-worker-0

We see that worker-0 is deleted and then a new worker-4 is started to restore the deleted worker-0.

chaos-test-edljob-worker-1 1/1 Running 0 2m3s
chaos-test-edljob-worker-2 1/1 Running 0 2m3s
chaos-test-edljob-worker-3 1/1 Running 0 2m3s
chaos-test-edljob-worker-4 1/1 Running 0 30s
elasticjob-chaos-test-dlrover-master 1/1 Running 0 2m7s

Through the logs of worker-1, we can see that training continues

kubectl -n dlrover logs chaos-test-edljob-worker-1

>>>
loss = 2.298487901687622, step = 0
loss = 2.195965051651001, step = 20
loss = 1.2307546138763428, step = 40
loss = 0.6579511761665344, step = 60
loss = 1.0608341693878174, step = 80
loss = 0.7761049270629883, step = 100

The training node is a slow node

In order to simulate that one of the training nodes is a slow node, we use chaosblade to increase the CPU load of a Pod to 90%.

blade create cpu load --cpu-percent 90

We replace command in the mnist example yaml file with the following command. Among them, start_chaos.sh cpu-overload will increase the CPU load of worker-1 to 90%, making it a slow node.

 command:
    -/bin/bash
    - -c
    - "(bash examples/pytorch/mnist/start_chaos.sh cpu-overload & amp;) & amp; & amp; \
        dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

Then use kubectl -n dlrover apply -f examples/pytorch/mnist/choas_test_job.yaml to submit a job. Pod is as follows:

elasticjob-torch-mnist-debug-dlrover-master 0/1 Completed 0 3h17m
torch-mnist-debug-edljob-worker-0 0/1 Completed 0 3h17m
torch-mnist-debug-edljob-worker-1 0/1 Error 0 3h17m
torch-mnist-debug-edljob-worker-2 0/1 Completed 0 3h17m
torch-mnist-debug-edljob-worker-3 0/1 Completed 0 3h17m
torch-mnist-debug-edljob-worker-4 0/1 Completed 0 3h10m

As you can see, worker-1 reported an error and exited. From the log of worker-1, we can see that worker-1 exited because it was a slow node.

[2023-09-26 03:52:20,235] [INFO] [training.py:707:run] Fault nodes are: [] and stragglers are: [1].
Traceback (most recent call last):
  File "/usr/local/bin/dlrover-run", line 8, in <module>
    sys.exit(main())
  ...
  File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 733, in run
    raise RuntimeError("The node is a straggler and exits.")
RuntimeError: The node is a straggler and exits.

This is because this job enables node checking and automatic error exit for slow nodes, dlrover --network-check --exclude-straggler. If you do not want the slow node to report an error and exit, you can remove --exclude-straggler. When checking nodes, dlrover-run lets each node perform a simple matrix multiplication and allgather task and calculates the time consumption.

At the same time, we can check the time taken by each node to perform network detection tasks from the log of the job master node elasticjob-torch-mnist-debug-dlrover-master.

kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed

>>>
Round 0: The node elapsed time is {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752}
Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961}
Round 2: The node elapsed time are {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097}

Each network detection of DLRover is divided into two rounds. It can be seen that after the first two rounds of detection, worker-1 takes much longer than other nodes. After worker-1 exits with an error, DLRover restarts worker-4 to replace worker-1. Worker-4 is a normal node. In network detection, the time consumption is basically similar to that of other nodes, so there is no impact of slow nodes.

The training Pod is scheduled on the faulty machine

If the training Pod is scheduled on a faulty machine, for example, the cluster’s GPU card fails, the training process cannot be started. In order to simulate a faulty machine, we use chaosblade to terminate the child process of PyTorch, but the child process exits with an error. We replace command in the mnist example yaml file with the following command.

 command:
    -/bin/bash
    - -c
    - "(bash examples/pytorch/mnist/start_chaos.sh kill-process & amp;) & amp; & amp; \
        dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

start_chaos.sh kill-process will kill the network detection sub-process started by dlrover-run in worker-1. This simulates that worker-1 is a faulty machine, that is, the GPU process cannot be started normally on the faulty machine. After submitting the job, we can see that worker-1 exited with an error and restarted worker-4, which is a normal node.

chaos-test-edljob-worker-0 1/1 Running 0 12m
chaos-test-edljob-worker-1 0/1 Error 0 12m
chaos-test-edljob-worker-2 1/1 Running 0 12m
chaos-test-edljob-worker-3 1/1 Running 0 12m
chaos-test-edljob-worker-4 1/1 Running 0 3m59s
elasticjob-chaos-test-dlrover-master 1/1 Running 0 12m

By looking at the log of worker-1, we can see that worker-1 exited due to a faulty machine.

Traceback (most recent call last):
  ....
  File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 732, in run
    raise RuntimeError("The node network is breakdown.")
RuntimeError: The node network is breakdown.

At the same time, we can view the network detection results of each node from the log of the job master node elasticjob-torch-mnist-debug-dlrover-master. Each test is divided into two rounds. We can see that worker-1 failed in the first two rounds of checks, so it reported an error and exited. After worker-1 exits, the newly started worker-4 is not a faulty node, so the check passes and the model training starts.

Round 1: The node status are {1: False, 2: True, 3: True, 0: False}.
Round 2: The node status are {1: False, 2: True, 3: True, 0: True}.
Round 3: The node status are {3: True, 0: True, 1: True, 2: True}.

Pod network failure during training

In this experiment, we first start a normal training job. We replace the command in the mnist example yaml file with the following command.

 command:
    -/bin/bash
    - -c
    - "(bash examples/pytorch/mnist/start_chaos.sh no-chaos & amp;) & amp; & amp; \
        dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

After the loss information of model training appears in the log of worker-1, we enter worker-1 and use chaosblade to make its network packet loss rate 100% and cause a network failure.

kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash
./chaosblade-1.7.2/blade create network loss --percent 100 --interface eth0

Then, we see that worker-1 will exit with an error and worker-4 starts.

chaos-test-edljob-worker-0 1/1 Running 0 4m39s
chaos-test-edljob-worker-1 0/1 Error 0 4m39s
chaos-test-edljob-worker-2 1/1 Running 0 4m39s
chaos-test-edljob-worker-3 1/1 Running 0 4m39s
chaos-test-edljob-worker-4 1/1 Running 0 17s
elasticjob-chaos-test-dlrover-master 1/1 Running 0 4m43s

Then through worker-0’s logs, we can see that training recovery continues.

loss = 0.24101698398590088, step = 0
loss = 0.4646361768245697, step = 20

Training process crashed

In this experiment, we first start a normal training job, and then kill a process through kill -9 to observe whether training resumes. We need to replace the first line of command in the MNIST example yaml file command with (bash examples/pytorch/mnist/start_chaos.sh no-chaos & amp;) & amp; & amp;, and then start the job .

kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash
ps -aux | grep cnn_train.py

After the loss information of model training appeared in the log of worker-1, I entered the Pod to find the training process. Then we terminate any training process via kill -9 ${PID}. By checking the Pod status, you can see that the Pod did not exit with an error. Indicates that training is continuing. Because dlrover-run will restart the child process in the Pod.

chaos-test-edljob-worker-0 1/1 Running 0 3m4s
chaos-test-edljob-worker-1 1/1 Running 0 3m4s
chaos-test-edljob-worker-2 1/1 Running 0 3m4s
chaos-test-edljob-worker-3 1/1 Running 0 3m4s
elasticjob-chaos-test-dlrover-master 1/1 Running 0 3m9s

Automatic expansion and contraction of training

In this experiment, we use the MNIST example to submit jobs when the cluster resources do not satisfy all 4 nodes. As you can see, the job has only started 3 Pods, and one of them is pending due to insufficient resources.

elasticjob-torch-mnist-dxlrover-master 1/1 Running 0 57s
torch-mnist-edljob-worker-0 1/1 Running 0 47s
torch-mnist-edljob-worker-1 1/1 Running 0 47s
torch-mnist-edljob-worker-2 1/1 Running 0 47s
torch-mnist-edljob-worker-3 0/1 Pending 0 47s

Because in this job, we set --nnodes=3:$WORKER_NUM, where WORKER_NUM is an environment variable automatically set by DLRover in the Pod, and its value is replicas. This experiment Medium is 4. About 2 minutes later, we can see from the log of worker-0 that the three running nodes have started training. Rendezvous’s log group_world_size=3 indicates that there are 3 nodes for network training.

[2023-09-27 02:23:21,097] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=192.168.0.71
  master_port=36725
  group_rank=0
  group_world_size=3
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[6, 6]
  global_world_sizes=[6, 6]

rank 1 is initialized local_rank = 1
loss = 2.3198373317718506, step = 0
loss = 1.7543025016784668, step = 20

Then we kill other jobs on the cluster so that worker-3 can start.

elasticjob-torch-mnist-dlrover-master 1/1 Running 0 5m39s
torch-mnist-edljob-worker-0 1/1 Running 0 5m34s
torch-mnist-edljob-worker-1 1/1 Running 0 5m34s
torch-mnist-edljob-worker-2 1/1 Running 0 5m34s
torch-mnist-edljob-worker-3 1/1 Running 0 5m34s

Then from the log of worker-0, we can see that there are 4 nodes in the network for training. Note that this job has been expanded from 3 nodes to 4 nodes.

[2023-09-27 02:25:43,362] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.71
  master_port=58241
  group_rank=0
  group_world_size=4
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[8, 8]
  global_world_sizes=[8, 8]

rank 1 is initialized local_rank = 1rank 0 is initialized local_rank = 0

loss = 2.2984073162078857, step = 0
loss = 2.1407980918884277, step = 20
loss = 1.1324385404586792, step = 40

Next, we create a program fault in worker-1, causing worker-1 to report an error and exit. In this MNIST example, because the Pod is not configured to report an error and restart, the job does not start a new worker after worker-1 reports an error.

elasticjob-torch-mnist-dlrover-master 1/1 Running 0 7m43s
torch-mnist-edljob-worker-0 1/1 Running 0 7m38s
torch-mnist-edljob-worker-1 0/1 Error 0 7m38s
torch-mnist-edljob-worker-2 1/1 Running 0 7m38s
torch-mnist-edljob-worker-3 1/1 Running 0 7m38s

Then from the log of worker-0, we can see through group_world_size=3 that the training has been reduced to 3 nodes.

[2023-09-27 03:18:00,815] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.66
  master_port=39705
  group_rank=0
  group_world_size=3
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[6, 6]
  global_world_sizes=[6, 6]

[2023-09-27 03:18:05,957] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467
[2023-09-27 03:18:05,958] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467
loss = 0.2617453336715698, step = 0
loss = 0.2548859417438507, step = 20

Fault tolerance of TensorFlow PS distributed training

We will simulate the following scenarios to verify the elastic fault tolerance of DLRover in TensorFlow PS distributed training:

Worker Pod evicted
Worker Pod OOM
PS Pod evicted

Worker Pod was evicted

We use the TF training example to submit a job with 2 workers and a PS node. After submission, the Pod is as follows:

kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml

>>>
deepctr-manual-scale-edljob-chief-0 1/1 Running 0 88s
deepctr-manual-scale-edljob-ps-0 1/1 Running 0 88s
deepctr-manual-scale-edljob-worker-0 1/1 Running 0 88s
elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 99s

You can see that the job started a chief-0, worker-0 and ps-0. In a TensorFlow PS job, chief is also a worker. So this job starts two workers, one PS. Then we manually kill worker-0 to simulate the Pod being evicted.

kubectl -n dlrover delete pod deepctr-manual-scale-edljob-worker-0

>>>
NAME READY STATUS RESTARTS AGE
deepctr-manual-scale-edljob-chief-0 1/1 Running 0 2m57s
deepctr-manual-scale-edljob-ps-0 1/1 Running 0 2m57s
deepctr-manual-scale-edljob-worker-1 1/1 Running 0 60s
elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 3m8s

You can see that after worker-0 was killed, the job started worker-1 to recover.

Next, we use chaosblade to create OOM in chief-0.

kubectl -n dlrover exec -it deepctr-manual-scale-edljob-worker-0 bash
chaosblade-1.7.2/blade create mem load --mode ram --mem-percent 80

Then we can see that chief-0 exited due to OOMKilled and chief-1 started.

deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 4m53s
deepctr-manual-scale-edljob-chief-1 1/1 Running 0 64s
deepctr-manual-scale-edljob-ps-0 1/1 Running 0 4m53s
deepctr-manual-scale-edljob-worker-1 1/1 Running 0 2m56s

By looking at the resource configurations of chief-0 and chief-1, we can see that the memory of chief-1 has increased from 4Gi to 8Gi. Because DLRover will automatically increase memory and restart the Pod for OOMKilled Pods to prevent OOM from happening again.

kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-0 -o yaml | grep memory
>>>
        memory: 4Gi
        memory: 4Gi

kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-1 -o yaml | grep memory
>>>
        memory: 8Gi
        memory: 8Gi

PS Pod evicted

On the submitted job above, we manually remove ps-0. Then ps-1 started up, resuming the role of ps-0.

kubectl -n dlrover delete pod deepctr-manual-scale-edljob-ps-0

deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 10m
deepctr-manual-scale-edljob-chief-1 1/1 Running 0 7m1s
deepctr-manual-scale-edljob-ps-1 1/1 Running 0 109s
deepctr-manual-scale-edljob-worker-1 0/1 OOMKilled 0 8m53s
deepctr-manual-scale-edljob-worker-2 1/1 Running 0 4m13s
elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 11m

From the log of chief-1, we see that chief loaded the model from the checkpoint and continued training.

[2023-09-26 19:24:00,861] [INFO][saver.py:1531:restore] Restoring parameters from /nas/deepctr/model.ckpt-126
[2023-09-26 19:24:03,473] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op.
[2023-09-26 19:24:03,580] [INFO] [resource.py:164:report_resource] Report Resource CPU: 0.98, Memory 7146, GPU []
[2023-09-26 19:24:03,670] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op.
[2023-09-26 19:24:07,665] [INFO][basic_session_run_hooks.py:627:_save] Saving checkpoints for 126 into /nas/deepctr/model.ckpt.

Summary

Through the above experiments, we used ChaosBlade to verify that DLRover can automatically recover from various training failures and improve the stability of distributed training. This can significantly reduce manual operation and maintenance costs and improve training efficiency. In the next article, we will introduce how DLRover automatically adjusts the Batch size of DataLoader to automatically improve training throughput.

DLRover

DLRover (Distributed Deep Learning System) is an open source community maintained by the Ant Group AI Infra team. It is an intelligent distributed deep learning system based on cloud native technology. DLRover enables developers to focus on the design of the model architecture without having to deal with any engineering details, such as hardware acceleration and distributed operation. Currently, DLRover supports the use of K8s and Ray for automated operations and maintenance of deep learning training tasks. For more AI Infra technology, please pay attention to the DLRover project.

Join DLRover DingTalk technology exchange group: 31525020959

If you think you have gained something, please click to star DLRover:
https://github.com/intelligent-machine-learning/dlrover