Detailed restoration of one of the reasons why Elasticsearch cluster sharding appears unassigned

Personal homepage: IT Pindao_Big data OLAP system technology stack, Apache Doris, Clickhouse technology-CSDN blog

Private chat with bloggers: Join the big data technology discussion group chat to get more big data information.

Blogger’s personal B stack address: Brother Bao teaches you about big data’s personal space – Brother Bao teaches you about big data personal homepage – Bilibili Video

Table of Contents

background

Problem recovery

Troubleshooting and locating

Problem thinking

problem solved

… …

Recently, some nodes in the company’s ES cluster have failed, causing some index shards to remain in the unassigned state, causing the ES cluster status to be RED. It does not get better after waiting for a long time, which greatly affects the look and feel of the cluster UI. Think about the reason and solve it.

First, restore the phenomenon of unassigned shards in a wave of ES clusters.

background

There is an Elastic cluster consisting of 9 nodes. The cluster details are as follows:

Problem Recovery

1. Create a new test index, specify 3 shards, and 2 copies of each shard

2. After the test index is created, the distribution of shards in the cluster is as follows

3. Now kill the nodes where the test index, primary shard No. 0 and replica shard No. 0 are located

As shown in the figure above, the nodes where the primary shard No. 0 and the replica shard No. 0 are located are es3, es5, and es6 nodes respectively. The es processes on these nodes will be killed accordingly.

4. Check the es cluster situation again, RED situation appears

After waiting for a while, I found that the No. 2 shard of the test index will automatically be evenly distributed in the ES cluster. At this moment, if some primary shards are on the killed es3, es5, or es6 nodes, don’t worry, because these shards will also look for the node where the shard copy is located in the cluster and automatically upgrade to the primary shard of the changed shard.

However, you will find that after waiting and waiting, the 0th shard of the test index will not be automatically allocated, and it will always be in the Unassigned state! That’s what I’m talking about.

Troubleshooting and locating

There are many reasons for the shard Unassigned status in the ES cluster. You can refer to the es official website for explanation. The ES version I am using here is version 7.17: cat shards API | Elasticsearch Guide [7.17] | Elastic

Follow the ES official website method to troubleshoot the reason why the shards I demonstrated above are not allocated. The method is as follows.

1) Execute the following command to check the reason why the shards are not allocated

GET _cluster/allocation/explain

The results are as follows, with screenshots attached:

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : ".ds-ilm-history-5-2023.11.01-000001",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2023-11-01T12:40:49.352Z",
    "details" : "node_left [GQ5oVVTiQeSGbWsv7OAptw]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions" : [
    {
      "node_id" : "-InsxrJ0RNOVMgEl0Nv2Xg",
      "node_name" : "es9",
      "transport_address" : "192.168.179.8:9309",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "2IUrp8zYQDa9pG6j0z59wQ",
      "node_name" : "es4",
      "transport_address" : "192.168.179.8:9304",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "Ll0UgYKSTIGMii5OdB4Kvg",
      "node_name" : "es2",
      "transport_address" : "192.168.179.8:9302",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "YDisZ0KVTyuu1CfojY5Iyw",
      "node_name" : "es7",
      "transport_address" : "192.168.179.8:9307",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "smY0M3lETju-eWmw2b5lqA",
      "node_name" : "es8",
      "transport_address" : "192.168.179.8:9308",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

The query result screenshot is as follows

Pay attention to the key paragraph:

“unassigned_info” : {
“reason” : “NODE_LEFT”,
“at” : “2023-11-01T12:40:49.352Z”,
“details” : “node_left [GQ5oVVTiQeSGbWsv7OAptw]”,
“last_allocation_status” : “no_valid_shard_copy”
},

The reason here is NODE_LEFT. What the official website says means that the node where the shard is located is offline. In other words, the nodes where the primary shard and replica shard of shard No. 0 are located are both down.

The official description is as follows: cat shards API | Elasticsearch Guide [7.17] | Elastic

NODE_LEFT: Unassigned as a result of the node hosting it leaving the cluster.

Now it is roughly clear why the shards in the ES cluster are always Unassigned: the nodes where the primary shard and the replica shard corresponding to a certain shard in the index are located are down, resulting in the shard being unable to be allocated. The ES cluster status is Red, and it remains in this status no matter how long you wait.

Thinking Questions

Let me summarize it first, otherwise I will forget:

1. In the ES cluster, if the number of cluster nodes exceeds the number of es index shard copies and the index copy is not 1, then when the master node where the shard is located hangs up, the node where the shard copy is located will automatically be promoted to the master node. Slices will not cause Red to appear in the ES cluster.

2. If the nodes where the primary shard and replica shard corresponding to a certain shard are located are down (this is the situation restored earlier), in this case, you can manually force the shard to be allocated to a normal node. If so The operation means to set the shard to empty. This is not recommended because the data will be lost with a high probability. If the data is not important, you can do this.

Problem Solving

I emphasize again: If the number of your ES cluster nodes is okay and the number of index shards is not 1, the probability that all the nodes where the shards are located is generally down is small. Therefore, if your situation is like the one I restored, it is recommended to focus on investigating the cause of the ES node failure and fundamentally solve the problem from this perspective. It is not recommended to directly force allocation of shards to other nodes. It is better to wait for the node where the primary shard or replica shard is located to join the cluster normally, otherwise data will be lost.

So, if the index data is not important, what should I do if I want to force the shards to be allocated to other normal ES nodes? ? ?

Directly issue the command:

POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "test", #index name
        "shard": 0, #The shard id of the operation
        "node": "es2", #The node to which empty shards are allocated
        "accept_data_loss": true #Received data may be lost
      }
    }
  ]
}

For a detailed explanation of the above commands, please refer to the es official website explanation: Cluster reroute API | Elasticsearch Guide [7.17] | Elastic

Especially the explanation of allocate_empty_primary:

allocate_empty_primary

Allocate an empty primary shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Using this command leads to a complete loss of all data that was indexed into this shard, if it was previously started. If a node which has a copy of the data rejoins the cluster later on, that data will be deleted. To ensure that these implications are well-understood, this command requires the flag accept_data_loss to be explicitly set to true.

After the above command is executed, the 0th fragment of the test index is empty and assigned to the es2 node, and the es cluster returns to normal. The screenshot of the ES cluster is as follows:

Of course, the reason why the cluster in the screenshot is abnormal is that other index shards have not been forced to execute the shard empty command. You can also execute the above command to empty other shards, and the cluster will become green. At least, the command just turned the test index into a normally available index.

… …

Damn it, you’ve all seen this, give me a like, subscribe to my paid column to support the following is not impossible, we are very professional in big data. . . Ha ha ha ha. . .