HDFS cluster rolling upgrade and downgrade rollback

Table of Contents

1. HDFS cluster rolling upgrade

1.1 Introduction

1.2 Rolling upgrade without downtime

1.2.1 Non-federated HA cluster

1.2.1.1 Rolling upgrade preparation

1.2.1.2 Upgrade Active NN and Standbys NN

1.2.1.3 Upgrade DN

1.2.1.4 Complete rolling upgrade

1.2.2 Federated HA Cluster

1.3 Downtime upgrade

1.3.1 Non-HA cluster

2. HDFS cluster downgrade and rollback

2.1 The difference between downgrade and rollback

2.2 HA cluster downgrade (downgrade)

2.2.1 Downgrade DataNode

2.2.2 Downgrade Active NameNode and Standby NameNode

2.2.3 Confirmation of downgrade operation

2.2.4 HA cluster downgrade considerations

2.3 Cluster rollback operation


1. HDFS Cluster rolling upgrade

1.1 Introduction

In Hadoop v2, HDFS supports NameNode High Availability (HA). Makes it feasible to upgrade HDFS without downtime. Please note that rolling upgrades are only supported starting with Hadoop-2.4.0. ThereforeIn order toupgrade anHDFS cluster without downtime, one must use strong>HA Setting up the cluster .

In an HA cluster, there are two or more NameNode (NN), many DataNode (DN), some JournalNode (JN) and some ZooKeeperNode (ZKN). JN is relatively stable and in most cases does not require an upgrade when upgrading HDFS.

During rolling upgrade, only for NNs and DNs, not JNS and ZKNs. Upgrading JN and ZKN may cause cluster downtime.

1.2 Non-stop rolling upgrade

1.2.1 Non-Federated HA Cluster

Suppose there are two name nodes NN1 and NN2, where NN1 and NN2 are in Active and StandBy states respectively.

1.2.1.1 Rolling upgrade preparations
# Create a new fsimage file for rollback
hdfs dfsadmin -rollingUpgrade prepare

# Keep running the following command to check whether the rollback fsimage has been created.
# If Proceeding with Rolling Upgrade is displayed, it means it has been completed.
hdfs dfsadmin -rollingUpgrade query
1.2.1.2 Upgrade Active NN and Standbys NN
# Close NN2:
hdfs --daemon stop namenode
# Upgrade and start NN2:
hdfs --daemon start namenode -rollingUpgrade started

# Perform a failover switch so that NN2 becomes the Active node and NN1 becomes the Standby node.
# Close NN1:
hdfs --daemon stop namenode
# Upgrade and start NN1:
hdfs --daemon start namenode -rollingUpgrade started
1.2.1.3 Upgrade DN
# Select a small part of the DataNode nodes for upgrade (for example, filter according to the different racks where the DataNode is located).
# Close the selected DN for upgrade where IPC_PORT is specified by the parameter dfs.datanode.ipc.address and defaults to 9867.
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade

# Check whether the offline DataNode has stopped serving. If the node information can still be obtained, it means that the node has not been truly shut down.
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>

# Start the DN node.
hdfs --daemon start datanode

# Perform the above steps for all selected DN nodes. Repeat the above steps until all DN nodes in the cluster are upgraded.
1.2.1.4 Rolling upgrade completed
# Complete rolling upgrade
hdfs dfsadmin -rollingUpgrade finalize

1.2.2 Federated HA Clusters

A federation is a cluster with multiple namespaces. Each namespace corresponds to a pair of active and standby NameNode nodes. The above cluster is commonly known as Federation + HACluster.

The upgrade process of a federated cluster is similar to that of a non-federated cluster. There is no essential difference. It is just that the upgrade operation needs to be repeated several times for different namespaces.

#1. Perform upgrade preparations under each namespace
hdfs dfsadmin -rollingUpgrade prepare

#2. Upgrade the Active/Standby nodes under each namespace
#2.1. Close NN2:
hdfs --daemon stop namenode
#2.2. Upgrade and start NN2:
hdfs --daemon start namenode -rollingUpgrade started
#2.3. Perform a failover switch so that NN2 becomes the Active node and NN1 becomes the Standby node.
#2.4. Close NN1:
hdfs --daemon stop namenode
#2.5. Upgrade and start NN1:
hdfs --daemon start namenode -rollingUpgrade started

#3. Upgrade each DataNode node
#3.1. Close the selected DN for upgrade. IPC_PORT is specified by the parameter dfs.datanode.ipc.address, and the default is 9867.
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade
#3.2. Check whether the offline DataNode has stopped serving. If the node information can still be obtained, it means that the node has not actually been closed.
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>
#3.3. Start the DN node.
hdfs --daemon start datanode

#4. After the upgrade process is completed, execute the finalize confirmation command under each namespace.
hdfs dfsadmin -rollingUpgrade finalize

1.3 Downtime Upgrade

1.3.1 Non-HA Cluster

During the upgrade process, there will inevitably be a short period of time when the service is stopped because the NameNode needs to be restarted, and there are no backup nodes available during this time. The overall process is similar to the 4 steps of non-federated HA mode. However, the process of step 2 needs to be slightly modified:

#Step1: Rolling upgrade preparation

#Step2:Upgrade NN and SNN
#1. Close NN
hdfs --daemon stop namenode
#2. Upgrade and start NN
hdfs --daemon start namenode -rollingUpgrade started
#3. Stop SNN
hdfs --daemon stop secondarynamenode
#4. Upgrade and start SNN
hdfs --daemon start secondarynamenode -rollingUpgrade started

#Step3:Upgrade DN

#Step4:Complete rolling upgrade
hdfs dfsadmin -rollingUpgrade finalize

2. HDFS Cluster downgrade and rollbackRollback

2.1 Downgrade (downgrade) and rollback (< strong>rollback) Difference

  • Common points:

The version will be returned to the version before the upgrade;

After the upgrade’s finalize action is executed, downgrades and rollbacks are no longer allowed.

  • Differences:

Downgrade can support rollling method, which can roll downgrade, but rollback requires stopping the service for a period of time;

The downgrade process will only restore the software version to the one before the upgrade, and will retain the user’s existing data status;

Rolling back will restore user data to the state before the upgrade, and the existing data state will not be saved.

Friendly reminder: Be cautious when upgrading, and even more cautious when downgrading and rolling back.

In a production environment, scientific research must be conducted before cluster upgrade to evaluate the compatibility of the upgraded version with existing services. Completely simulate the upgrade process in the test environment, and back up the cluster status before the upgrade to avoid accidental cluster interruptions. Don’t expect to save the cluster through rollback, downgrade and other operations when the upgrade fails.

2.2 HA Cluster Downgrade (downgrade)

If you do not want to use the upgraded version, or in some unlikely circumstances, the upgrade fails (due to a bug in the newer version), the administrator can choose to downgrade HDFS to the pre-upgrade version, or roll back HDFS to the pre-upgrade version Version and status before upgrade.

Note that downgrades can be done on a rolling basis but cannot be rolled back. Rollback requires cluster downtime.

Please also note that downgrades and rollbacks are only possible after starting a rolling upgrade and before terminating the upgrade. An upgrade can be terminated by completing, downgrading, or rolling back. Therefore, it may be impossible to perform a rollback after completion or downgrade, or to perform a downgrade after completion.

2.2.1 Downgrade DataNode

#1. Select some collection of DataNode nodes (can be distinguished by rack)
# Perform the downgrade operation, where IPC_PORT is specified by the parameter dfs.datanode.ipc.address, and the default is 9867.
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade

# Execute the command to check whether the node is completely stopped
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>

# Repeat the above operations on other DataNode nodes in the selected set

2.2.2 DowngradeActive NameNode and Standby strong>NameNode

# Stop and demote Standby NameNode.
# Start Standby NameNode normally
# Trigger failover switching, causing the active and backup roles to be reversed
# Stop and downgrade the NameNode that was previously Active (now belongs to Standby)
# Start normally as a Standby node

2.2.3 DowngradeConfirmation

# Complete the downgrade operation
hdfs dfsadmin -rollingUpgrade finalize

2.2.4 HA Cluster Downgrade (downgrade ) Notes

Downgrade and upgrade have one thing in common in HA mode: when operating the NameNode, the operation starts from the Standby node. After the Standby node is upgraded/downgraded, a switch is performed so that the other node can be upgraded/downgraded. Throughout the process, an Active node is always maintained to provide external services.

The operation order of NameNode and DataNode in the downgrade process is exactly the opposite to that during upgrade: the new version is generally compatible with the old version in terms of protocols and APIs. If you downgrade NN first, it will cause DN to be the new version and NN to be the old version. Many protocols in the new version of DN will be incompatible with older versions of NN. So here you mustdowngradeDN first,and then >NN Downgrade. What seems like a simple reversal of order actually has a deeper reason behind it.

The downgrade operations of federated clusters and non-HA clusters correspond to the upgrade operations. Just replace the corresponding operation commands.

2.3 Cluster rollback (rollback) operation

Notes on rollback: Rollback does not support rolling operations. During the operation, it requires the cluster to stop providing external services.

The rollback operation will not only return the software version to the version before the upgrade, but also return the user data to the state before the upgrade.

Rollback steps:

#1. Stop all NameNode and DataNode nodes
#2. Restore the pre-upgrade software version on all node machines
#3. Execute the -rollingUpgrade rollback command on the NN1 node to start NN1 and use NN1 as the Active node
#4. Execute the -bootstrapStandby command on NN2 and start NN2 normally, using NN2 as a Standby node
#5. Start all DataNodes with the -rollback parameter

Previous article: Detailed graphic tutorial on building HDFS HA high-availability cluster_Stars.Sky’s blog-CSDN blog