Text | Wang Qinlong (Flower name: Chang Fan)
Ant Group AI Systems Engineer
ChaosBlade is an open source experimental injection tool developed by Alibaba that follows chaos engineering principles and chaos experiment models and can be used to verify the stability of cloud native systems. As a cloud-native distributed training system, DLRover provides elasticity and fault-tolerance functions to improve the stability of distributed training. To this end, we use ChaosBlade to create various chaos experiments to verify the stability of DLRover’s elastic fault tolerance.
Prerequisite: Create a k8s cluster and deploy DLRover ElasticJob
- Create a k8s cluster, and the cluster can be accessed locally through Kubectl. In the following experiments, we used Alibaba Cloud’s ACK cluster.
- To deploy DLRover ElasticJob on a k8s cluster, please refer to the documentation.
- To create a training image and install chaosblade in the image, please refer to the Dockerfile. At the same time, we also provide an image for training mnist, registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:torch201-mnist. We will use this image in the following experiments.
Python distributed training elastic fault tolerance
We will conduct experiments to simulate the following scenarios to verify the elastic fault tolerance function of DLRover distributed training:
- The training Pod is preempted or evicted.
- The training pod is a slow node.
- The training Pod was scheduled on a faulty machine.
- During the training process, the network of the training node fails.
- The training process crashed in the training pod.
- Automatic expansion and contraction of training.
Training Pod was preempted
In this experiment, we use the MNIST example to verify that DLRover can recover the preempted Pod. We replace command in the job yaml with the following command:
command: -/bin/bash - -c - "dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ --training_data /data/mnist_png/training/ \ --validation_data /data/mnist_png/testing/"
Submit a 4-node job
kubectl -n dlrover apply -f examples/pytorch/mnist/chaos_test_job.yaml
After submitting the job, we can see the following Pods through kubectl -n dlrover get pods
:
chaos-test-edljob-worker-0 1/1 Running 0 85s chaos-test-edljob-worker-1 1/1 Running 0 85s chaos-test-edljob-worker-2 1/1 Running 0 85s chaos-test-edljob-worker-3 1/1 Running 0 85s elasticjob-chaos-test-dlrover-master 1/1 Running 0 89s
We manually delete a Pod to simulate the Pod being preempted.
kubectl -n dlrover delete pod chaos-test-edljob-worker-0
We see that worker-0 is deleted and then a new worker-4 is started to restore the deleted worker-0.
chaos-test-edljob-worker-1 1/1 Running 0 2m3s chaos-test-edljob-worker-2 1/1 Running 0 2m3s chaos-test-edljob-worker-3 1/1 Running 0 2m3s chaos-test-edljob-worker-4 1/1 Running 0 30s elasticjob-chaos-test-dlrover-master 1/1 Running 0 2m7s
Through the logs of worker-1, we can see that training continues
kubectl -n dlrover logs chaos-test-edljob-worker-1 >>> loss = 2.298487901687622, step = 0 loss = 2.195965051651001, step = 20 loss = 1.2307546138763428, step = 40 loss = 0.6579511761665344, step = 60 loss = 1.0608341693878174, step = 80 loss = 0.7761049270629883, step = 100
The training node is a slow node
In order to simulate that one of the training nodes is a slow node, we use chaosblade to increase the CPU load of a Pod to 90%.
blade create cpu load --cpu-percent 90
We replace command in the mnist example yaml file with the following command. Among them, start_chaos.sh cpu-overload
will increase the CPU load of worker-1 to 90%, making it a slow node.
command: -/bin/bash - -c - "(bash examples/pytorch/mnist/start_chaos.sh cpu-overload & amp;) & amp; & amp; \ dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ --training_data /data/mnist_png/training/ \ --validation_data /data/mnist_png/testing/"
Then use kubectl -n dlrover apply -f examples/pytorch/mnist/choas_test_job.yaml
to submit a job. Pod is as follows:
elasticjob-torch-mnist-debug-dlrover-master 0/1 Completed 0 3h17m torch-mnist-debug-edljob-worker-0 0/1 Completed 0 3h17m torch-mnist-debug-edljob-worker-1 0/1 Error 0 3h17m torch-mnist-debug-edljob-worker-2 0/1 Completed 0 3h17m torch-mnist-debug-edljob-worker-3 0/1 Completed 0 3h17m torch-mnist-debug-edljob-worker-4 0/1 Completed 0 3h10m
As you can see, worker-1 reported an error and exited. From the log of worker-1, we can see that worker-1 exited because it was a slow node.
[2023-09-26 03:52:20,235] [INFO] [training.py:707:run] Fault nodes are: [] and stragglers are: [1]. Traceback (most recent call last): File "/usr/local/bin/dlrover-run", line 8, in <module> sys.exit(main()) ... File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 733, in run raise RuntimeError("The node is a straggler and exits.") RuntimeError: The node is a straggler and exits.
This is because this job enables node checking and automatic error exit for slow nodes, dlrover --network-check --exclude-straggler
. If you do not want the slow node to report an error and exit, you can remove --exclude-straggler
. When checking nodes, dlrover-run
lets each node perform a simple matrix multiplication and allgather task and calculates the time consumption.
At the same time, we can check the time taken by each node to perform network detection tasks from the log of the job master node elasticjob-torch-mnist-debug-dlrover-master
.
kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed >>> Round 0: The node elapsed time is {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752} Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961} Round 2: The node elapsed time are {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097}
Each network detection of DLRover is divided into two rounds. It can be seen that after the first two rounds of detection, worker-1 takes much longer than other nodes. After worker-1 exits with an error, DLRover restarts worker-4 to replace worker-1. Worker-4 is a normal node. In network detection, the time consumption is basically similar to that of other nodes, so there is no impact of slow nodes.
The training Pod is scheduled on the faulty machine
If the training Pod is scheduled on a faulty machine, for example, the cluster’s GPU card fails, the training process cannot be started. In order to simulate a faulty machine, we use chaosblade to terminate the child process of PyTorch, but the child process exits with an error. We replace command in the mnist example yaml file with the following command.
command: -/bin/bash - -c - "(bash examples/pytorch/mnist/start_chaos.sh kill-process & amp;) & amp; & amp; \ dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ --training_data /data/mnist_png/training/ \ --validation_data /data/mnist_png/testing/"
start_chaos.sh kill-process
will kill the network detection sub-process started by dlrover-run
in worker-1. This simulates that worker-1 is a faulty machine, that is, the GPU process cannot be started normally on the faulty machine. After submitting the job, we can see that worker-1 exited with an error and restarted worker-4, which is a normal node.
chaos-test-edljob-worker-0 1/1 Running 0 12m chaos-test-edljob-worker-1 0/1 Error 0 12m chaos-test-edljob-worker-2 1/1 Running 0 12m chaos-test-edljob-worker-3 1/1 Running 0 12m chaos-test-edljob-worker-4 1/1 Running 0 3m59s elasticjob-chaos-test-dlrover-master 1/1 Running 0 12m
By looking at the log of worker-1, we can see that worker-1 exited due to a faulty machine.
Traceback (most recent call last): .... File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 732, in run raise RuntimeError("The node network is breakdown.") RuntimeError: The node network is breakdown.
At the same time, we can view the network detection results of each node from the log of the job master node elasticjob-torch-mnist-debug-dlrover-master
. Each test is divided into two rounds. We can see that worker-1 failed in the first two rounds of checks, so it reported an error and exited. After worker-1 exits, the newly started worker-4 is not a faulty node, so the check passes and the model training starts.
Round 1: The node status are {1: False, 2: True, 3: True, 0: False}. Round 2: The node status are {1: False, 2: True, 3: True, 0: True}. Round 3: The node status are {3: True, 0: True, 1: True, 2: True}.
Pod network failure during training
In this experiment, we first start a normal training job. We replace the command in the mnist example yaml file with the following command.
command: -/bin/bash - -c - "(bash examples/pytorch/mnist/start_chaos.sh no-chaos & amp;) & amp; & amp; \ dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \ --nproc_per_node=2 --max_restarts=3 --rdzv_conf pend_timeout=600 \ examples/pytorch/mnist/cnn_train.py --num_epochs 5 \ --training_data /data/mnist_png/training/ \ --validation_data /data/mnist_png/testing/"
After the loss information of model training appears in the log of worker-1, we enter worker-1 and use chaosblade to make its network packet loss rate 100% and cause a network failure.
kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash ./chaosblade-1.7.2/blade create network loss --percent 100 --interface eth0
Then, we see that worker-1 will exit with an error and worker-4 starts.
chaos-test-edljob-worker-0 1/1 Running 0 4m39s chaos-test-edljob-worker-1 0/1 Error 0 4m39s chaos-test-edljob-worker-2 1/1 Running 0 4m39s chaos-test-edljob-worker-3 1/1 Running 0 4m39s chaos-test-edljob-worker-4 1/1 Running 0 17s elasticjob-chaos-test-dlrover-master 1/1 Running 0 4m43s
Then through worker-0’s logs, we can see that training recovery continues.
loss = 0.24101698398590088, step = 0 loss = 0.4646361768245697, step = 20
Training process crashed
In this experiment, we first start a normal training job, and then kill a process through kill -9
to observe whether training resumes. We need to replace the first line of command in the MNIST example yaml file command with (bash examples/pytorch/mnist/start_chaos.sh no-chaos & amp;) & amp; & amp;
, and then start the job .
kubectl -n dlrover exec -it chaos-test-edljob-worker-1 bash ps -aux | grep cnn_train.py
After the loss information of model training appeared in the log of worker-1, I entered the Pod to find the training process. Then we terminate any training process via kill -9 ${PID}
. By checking the Pod status, you can see that the Pod did not exit with an error. Indicates that training is continuing. Because dlrover-run
will restart the child process in the Pod.
chaos-test-edljob-worker-0 1/1 Running 0 3m4s chaos-test-edljob-worker-1 1/1 Running 0 3m4s chaos-test-edljob-worker-2 1/1 Running 0 3m4s chaos-test-edljob-worker-3 1/1 Running 0 3m4s elasticjob-chaos-test-dlrover-master 1/1 Running 0 3m9s
Automatic expansion and contraction of training
In this experiment, we use the MNIST example to submit jobs when the cluster resources do not satisfy all 4 nodes. As you can see, the job has only started 3 Pods, and one of them is pending due to insufficient resources.
elasticjob-torch-mnist-dxlrover-master 1/1 Running 0 57s torch-mnist-edljob-worker-0 1/1 Running 0 47s torch-mnist-edljob-worker-1 1/1 Running 0 47s torch-mnist-edljob-worker-2 1/1 Running 0 47s torch-mnist-edljob-worker-3 0/1 Pending 0 47s
Because in this job, we set --nnodes=3:$WORKER_NUM
, where WORKER_NUM
is an environment variable automatically set by DLRover in the Pod, and its value is replicas. This experiment Medium is 4. About 2 minutes later, we can see from the log of worker-0 that the three running nodes have started training. Rendezvous’s log group_world_size=3
indicates that there are 3 nodes for network training.
[2023-09-27 02:23:21,097] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.0.71 master_port=36725 group_rank=0 group_world_size=3 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[6, 6] global_world_sizes=[6, 6] rank 1 is initialized local_rank = 1 loss = 2.3198373317718506, step = 0 loss = 1.7543025016784668, step = 20
Then we kill other jobs on the cluster so that worker-3 can start.
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 5m39s torch-mnist-edljob-worker-0 1/1 Running 0 5m34s torch-mnist-edljob-worker-1 1/1 Running 0 5m34s torch-mnist-edljob-worker-2 1/1 Running 0 5m34s torch-mnist-edljob-worker-3 1/1 Running 0 5m34s
Then from the log of worker-0, we can see that there are 4 nodes in the network for training. Note that this job has been expanded from 3 nodes to 4 nodes.
[2023-09-27 02:25:43,362] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: restart_count=1 master_addr=192.168.0.71 master_port=58241 group_rank=0 group_world_size=4 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[8, 8] global_world_sizes=[8, 8] rank 1 is initialized local_rank = 1rank 0 is initialized local_rank = 0 loss = 2.2984073162078857, step = 0 loss = 2.1407980918884277, step = 20 loss = 1.1324385404586792, step = 40
Next, we create a program fault in worker-1, causing worker-1 to report an error and exit. In this MNIST example, because the Pod is not configured to report an error and restart, the job does not start a new worker after worker-1 reports an error.
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 7m43s torch-mnist-edljob-worker-0 1/1 Running 0 7m38s torch-mnist-edljob-worker-1 0/1 Error 0 7m38s torch-mnist-edljob-worker-2 1/1 Running 0 7m38s torch-mnist-edljob-worker-3 1/1 Running 0 7m38s
Then from the log of worker-0, we can see through group_world_size=3
that the training has been reduced to 3 nodes.
[2023-09-27 03:18:00,815] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result: restart_count=1 master_addr=192.168.0.66 master_port=39705 group_rank=0 group_world_size=3 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[6, 6] global_world_sizes=[6, 6] [2023-09-27 03:18:05,957] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467 [2023-09-27 03:18:05,958] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467 loss = 0.2617453336715698, step = 0 loss = 0.2548859417438507, step = 20
Fault tolerance of TensorFlow PS distributed training
We will simulate the following scenarios to verify the elastic fault tolerance of DLRover in TensorFlow PS distributed training:
- Worker Pod evicted
- Worker Pod OOM
- PS Pod evicted
Worker Pod was evicted
We use the TF training example to submit a job with 2 workers and a PS node. After submission, the Pod is as follows:
kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml >>> deepctr-manual-scale-edljob-chief-0 1/1 Running 0 88s deepctr-manual-scale-edljob-ps-0 1/1 Running 0 88s deepctr-manual-scale-edljob-worker-0 1/1 Running 0 88s elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 99s
You can see that the job started a chief-0, worker-0 and ps-0. In a TensorFlow PS job, chief is also a worker. So this job starts two workers, one PS. Then we manually kill worker-0 to simulate the Pod being evicted.
kubectl -n dlrover delete pod deepctr-manual-scale-edljob-worker-0 >>> NAME READY STATUS RESTARTS AGE deepctr-manual-scale-edljob-chief-0 1/1 Running 0 2m57s deepctr-manual-scale-edljob-ps-0 1/1 Running 0 2m57s deepctr-manual-scale-edljob-worker-1 1/1 Running 0 60s elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 3m8s
You can see that after worker-0 was killed, the job started worker-1 to recover.
Next, we use chaosblade to create OOM in chief-0.
kubectl -n dlrover exec -it deepctr-manual-scale-edljob-worker-0 bash chaosblade-1.7.2/blade create mem load --mode ram --mem-percent 80
Then we can see that chief-0 exited due to OOMKilled and chief-1 started.
deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 4m53s deepctr-manual-scale-edljob-chief-1 1/1 Running 0 64s deepctr-manual-scale-edljob-ps-0 1/1 Running 0 4m53s deepctr-manual-scale-edljob-worker-1 1/1 Running 0 2m56s
By looking at the resource configurations of chief-0 and chief-1, we can see that the memory of chief-1 has increased from 4Gi to 8Gi. Because DLRover will automatically increase memory and restart the Pod for OOMKilled Pods to prevent OOM from happening again.
kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-0 -o yaml | grep memory >>> memory: 4Gi memory: 4Gi kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-1 -o yaml | grep memory >>> memory: 8Gi memory: 8Gi
PS Pod evicted
On the submitted job above, we manually remove ps-0. Then ps-1 started up, resuming the role of ps-0.
kubectl -n dlrover delete pod deepctr-manual-scale-edljob-ps-0 deepctr-manual-scale-edljob-chief-0 0/1 OOMKilled 0 10m deepctr-manual-scale-edljob-chief-1 1/1 Running 0 7m1s deepctr-manual-scale-edljob-ps-1 1/1 Running 0 109s deepctr-manual-scale-edljob-worker-1 0/1 OOMKilled 0 8m53s deepctr-manual-scale-edljob-worker-2 1/1 Running 0 4m13s elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 11m
From the log of chief-1, we see that chief loaded the model from the checkpoint and continued training.
[2023-09-26 19:24:00,861] [INFO][saver.py:1531:restore] Restoring parameters from /nas/deepctr/model.ckpt-126 [2023-09-26 19:24:03,473] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op. [2023-09-26 19:24:03,580] [INFO] [resource.py:164:report_resource] Report Resource CPU: 0.98, Memory 7146, GPU [] [2023-09-26 19:24:03,670] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op. [2023-09-26 19:24:07,665] [INFO][basic_session_run_hooks.py:627:_save] Saving checkpoints for 126 into /nas/deepctr/model.ckpt.
Summary
Through the above experiments, we used ChaosBlade to verify that DLRover can automatically recover from various training failures and improve the stability of distributed training. This can significantly reduce manual operation and maintenance costs and improve training efficiency. In the next article, we will introduce how DLRover automatically adjusts the Batch size of DataLoader to automatically improve training throughput.
DLRover
DLRover (Distributed Deep Learning System) is an open source community maintained by the Ant Group AI Infra team. It is an intelligent distributed deep learning system based on cloud native technology. DLRover enables developers to focus on the design of the model architecture without having to deal with any engineering details, such as hardware acceleration and distributed operation. Currently, DLRover supports the use of K8s and Ray for automated operations and maintenance of deep learning training tasks. For more AI Infra technology, please pay attention to the DLRover project.
Join DLRover DingTalk technology exchange group: 31525020959
If you think you have gained something, please click to star DLRover:
https://github.com/intelligent-machine-learning/dlrover