hadoop metadata namenode security guarantee

Article directory

- Background introduction
- problem scenario
- logic scheme
- Design
- Solution advantages and disadvantages
- Protection plan and data and service recovery testing
- Operation and maintenance steps

Background introduction

NameNode is the namespace that manages the entire Hadoop file system. It maintains the number of file systems and all files and directories in the file tree. If the NameNode metadata is accidentally lost, it means that the entire HDFS data is lost irrecoverably, leading to the loss of the entire data asset, which is undoubtedly catastrophic.

Problem Scenario

During the operation and maintenance of the Hadoop system or in Hadoop-related applications, relevant personnel may accidentally format the NameNode, which will cause the entire HDFS file to be lost and irrecoverable.

Logical scheme

Project design

Version Hadoop2.x supported
Version Hadoop3.x (to be verified)

Modify Hadoop-related commands and disable the use of NameNode formatting commands on Hadoop-related commands.

Modify the commands of all hadoop and all hadoop clients and add them in the ${HADOOP_HOME}/bin/hadoop, ${HADOOP_HOME}/bin/hdfs files

if [[ $* =~ "-format" ]]; then
echo "It is forbidden to use -format related commands. If there are any exceptions, please contact the relevant operation and maintenance"
exit
fi

Note: Only the hadoop client with namenode and journalnode services can perform the formatting operation of the namenode, so it is necessary to ensure that these hadoops are changed.

Back up namenode and journal data regularly, and restore data and services when data is accidentally formatted.

Scheduled backup related script samples
vim /opt/hadoop/sbin/hadoop-backup.sh

#!/bin/bash
#########################
# Back up namenode and journal data regularly
# Execute on the node of namenode
# Execute once every hour
#########################

dt=`date + "%Y%m%d-%H%M"`
pre_dt=`date + "%Y%m%d" -d "-2day"`
pre_hour=`date + "%Y%m%d-%H" -d "-2hour"`
del_hour=`date + "%H" -d "-2hour"`
echo "current time" ${<!-- -->dt}
echo "three days ago" ${<!-- -->pre_dt}
echo "three hours ago" ${<!-- -->pre_hour}
echo "three hours ago" ${<!-- -->del_hour}

if [ ! -d ”/data/hadoop/hadoopbak” ];then
   mkdir -p /data/hadoop/hadoopbak
fi

#If the namenode data is large, it can be copied directly without compression. The scheduled backup interval can be changed to one day.
# Compress and backup namenode and journal data
cd /data/hadoop/dfs
cp -r name /data/hadoop/hadoopbak/${<!-- -->dt}_name
cp -r journal /data/hadoop/hadoopbak/${<!-- -->dt}_journal

#Delete the namenode and journal data of the previous 2 hours
#### 00 will not be deleted ####
if [ ${<!-- -->del_hour} != '00' ];then
    #Delete namenode and journal data from historical backups
    if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_name ]; then
        rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_name
    fi
    
    if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_journal ]; then
        rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_journal
    fi
fi

# Delete the namenode and journal data of historical backup
if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_name ]; then
    rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_name
fi

if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_journal ]; then
    rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_journal
fi

crontab scheduled scheduling
0 */1 * * * sh /opt/hadoop/sbin/hadoop-backup.sh

Advantages and Disadvantages of the Plan

Advantages

Plan 1 and Plan 2 can prevent scenarios such as using the wrong command and accidentally executing the wrong command.
Can quickly restore data and restore services
Option 1 + Option 2 The operation is relatively simple, the version is relatively common, and the overall solution is easy to implement.

Disadvantages

Since data is written and changed in real time, there is a time interval between data backup and formatting, and data changes within this interval cannot be recovered.
Solution 1 cannot prevent scenarios such as malicious changes and malicious destruction of data.

Guarantee plan and data and service recovery testing

Set up a test hadoop cluster

Version 2.7.3
Machine planning
Installation process omitted

Implementation plan one, modify all hadoop and hdfs commands of hadoop

vim /opt/hadoop/bin/hadoop
hadoop startup command
vim /opt/hadoop/bin/hdfs
hdfs startup command

Test execution format command

hadoop command test

Continuously writing data to the hadoop cluster

vi test.sh

#!/bin/bash

i=1
while [ $i -le 7200 ]
do
  echo ${<!-- -->i} > $i.txt
  hadoop fs -put $i.txt /
  let i++
  sleep 1s
done

sh test.sh

Implementation plan two, backup files
sh /opt/hadoop/sbin/hadoop-backup.sh
Recovery command to force format.
hdfs namenode -format

At this time, the tested file has been written to 89.txt. Because the namenode was formatted, the namenode service was down, writing stopped, and no HDFS data could be viewed.

Use the backup of Solution 2 to restore data and restore services.

Stop all journalnodes

./opt/hadoop/sbin/hadoop-daemon.sh stop journalnode
#If you encounter a service that cannot be stopped, you can kill it directly.

Delete the current namenode and journalnode data, and copy the backed up namenode and journalnode data to the corresponding directories of the three hosts 127, 128, and 129 respectively.

cd /data/hadoop/dfs/
rm -rf journal/name/
mv /data/hadoop/hadoopbak/20210402-0923_journal.zip ./
mv /data/hadoop/hadoopbak/20210402-0923_journal.zip ./
unzip 20210402-0923_journal.zip
unzip 20210402-0923_journal.zip
# 127 128 129 The three servers are the same

Start journalndoe on 127, 128, and 129 respectively

cd /opt/hadoop/sbin
./hadoop-daemon.sh start journalnode

Start namenode

cd /opt/hadoop/sbin
./hadoop-daemon.sh start namenode

If you encounter the following error message, it means that there is synchronization data loss between namenode and journalnode during the backup process. First repair the backed up namenode data.

2021-04-02 10:27:04,789 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2021-04-02 10:27:04,790 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2021-04-02 10:27:04,790 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException: There appears to be a gap in the edit log. We expected txid 24666, but got txid 24690.
    at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:698)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:976)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:812)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:796)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
2021-04-02 10:27:04,792 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2021-04-02 10:27:04,794 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/****************************************************** ***********
SHUTDOWN_MSG: Shutting down NameNode at 192-168-72-127/192.168.72.127

Repair backed up namenode data

hadoop namenode -recover
# Just keep selecting y or c

Restart the namenodes of 127 and 128 respectively.

cd /opt/hadoop/sbin
./hadoop-daemon.sh start namenode

Verify recovered data and services
View historical data

Note: The historical data only reaches 41.txt. In fact, it has been uploaded to 89.txt by the time it was formatted. The data between backup and formatting is lost.

Rewrite data

sh test.sh
#
# If you encounter the following error, it means that the datanode has not completed the information reporting to the namenode. You can wait for a few minutes or restart the datanode.
#
    21/04/02 10:39:04 INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[192.168.72.127:50010,DS-1aa30491-c989-4bb4-8d08-16beced62f02,DISK]
21/04/02 10:39:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2282)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1343)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
21/04/02 10:39:04 INFO hdfs.DFSClient: Abandoning BP-270677476-192.168.72.127-1617247664550:blk_1073745608_4784
21/04/02 10:39:04 INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[192.168.72.128:50010,DS-d7a08bf0-e12e-4d2b-953f-5b67d469c571,DISK]
21/04/02 10:39:04 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /101.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node (s) are excluded in this operation.

##
###Restart datanode
cd /opt/hadoop/sbin
./hadoop-daemon.sh stop datanode
./hadoop-daemon.sh start datanode

The data is successfully written, indicating that the HDFS data writing function has been fully restored.

Note: At this point, namendoe has been completed from being formatted to data and service recovery.

Operation and maintenance online steps

Update and modify the commands for all hadoop and all hadoop clients, add them in the ${HADOOP_HOME}/bin/hadoop, ${HADOOP_HOME}/bin/hdfs files

## Note: Add below the function print_usage() method.
if [[ $* =~ "-skipTrash" ]] || [[ $* =~ "-format" ]]; then
echo "It is forbidden to use -skipTrash and -format related commands. If there are exceptions, please contact the relevant operation and maintenance"
exit
fi

On the namenode machine, edit the backup script
Scheduled backup script, user array is hadoop
vim /opt/hadoop/sbin/hadoop-backup.sh

#!/bin/bash
#########################
# Back up namenode and journal data regularly
# Execute on the node of namenode
# Execute once every hour
#########################

dt=`date + "%Y%m%d-%H%M"`
pre_dt=`date + "%Y%m%d" -d "-2day"`
pre_hour=`date + "%Y%m%d-%H" -d "-2hour"`
del_hour=`date + "%H" -d "-2hour"`
echo "current time" ${<!-- -->dt}
echo "three days ago" ${<!-- -->pre_dt}
echo "three hours ago" ${<!-- -->pre_hour}
echo "three hours ago" ${<!-- -->del_hour}

if [ ! -d ”/data/hadoop/hadoopbak” ];then
   mkdir -p /data/hadoop/hadoopbak
fi

# Compress and backup namenode and journal data
#cd /data/hadoop/dfs
#zip -r /data/hadoop/hadoopbak/${dt}_name.zip name
#zip -r /data/hadoop/hadoopbak/${dt}_journal.zip journal

# Delete the namenode and journal data of historical backup
#if [ -f /data/hadoop/hadoopbak/${pre_dt}*_name.zip ]; then
# rm -rf /data/hadoop/hadoopbak/${pre_dt}*_name.zip
#fi

#if [ -f /data/hadoop/hadoopbak/${pre_dt}*_journal.zip ]; then
# rm -rf /data/hadoop/hadoopbak/${pre_dt}*_journal.zip
#fi

#If the namenode data is large, it can be copied directly without compression. The scheduled backup interval can be changed to one day.
# Compress and backup namenode and journal data
cd /data/hadoop/dfs
cp -r name /data/hadoop/hadoopbak/${<!-- -->dt}_name
cp -r journal /data/hadoop/hadoopbak/${<!-- -->dt}_journal

#Delete the namenode and journal data of the previous 2 hours
#### 00 will not be deleted ####
if [ ${<!-- -->del_hour} != '00' ];then
    #Delete namenode and journal data from historical backups
    if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_name ]; then
        rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_name
    fi
    
    if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_journal ]; then
        rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_hour}*_journal
    fi
fi

# Delete the namenode and journal data of historical backup
if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_name ]; then
    rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_name
fi

if [ -d /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_journal ]; then
    rm -rf /data/hadoop/hadoopbak/${<!-- -->pre_dt}*_journal
fi

Add crontab scheduled schedule under hadoop user
0 */1 * * * sh /opt/hadoop/sbin/hadoop-backup.sh