java–kafka essay–broker&topic&partition&copy-replica understanding

First, let’s take a look at basic Message-related terms:

Name Explanation
Broker Message middleware processing node, a Kafka node is a broker, and one or more Brokers can form a Kafka cluster
Topic Kafka categorizes messages according to topics. Each message published to the Kafka cluster needs to specify a topic
Producer Message producer, the client that sends messages to Broker
Consumer Message consumer, client that reads messages from Broker
ConsumerGroup Each Consumer belongs to a specific Consumer Group. A message can be consumed by multiple different Consumer Groups, but only one Consumer in a Consumer Group can consume the message
Partition Physical concept, a topic can be divided into multiple partitions, and the internal messages of each partition are ordered

1. Agent Broker

Previously, we have introduced to you that producers deliver messages to the message queue, and consumers pull data from the message queue.

A very important concept in the Kafka message queue is the agent Broker. Can you imagine what a commodity agent does in life? Purchase, inventory, sales.

Kafka’s Broker also plays the same role: receiving messages, saving messages, and providing messages to consumers.

Specific to the Kafka architecture level, we can think of a Broker agent as a Kafka service instance.

Kafka can start multiple service instances to form a service cluster with multiple Broker agents.

Generally, the more Brokers in a cluster, the stronger the overall throughput capacity of the Kafka cluster.

This is easy to understand. In real life, the more agents a product has, the stronger its sales ability is. It is a truth.

Because Kafka is usually deployed in a distributed manner, a physical server (an operating system) usually only deploys and starts one Kafka instance, so in this scenario the Broker agent can be understood as a server.

Can one server deploy multiple kafka instances? Yes, it is possible to avoid port conflicts by modifying the port, but this is not good. Because the distributed deployment of Kafka takes into account high availability,

That is: one server is down, but the kafka cluster is still available. If a server deploys multiple Kafka instances, once the server goes down, the impact will be huge, and it is likely to directly bring down the Kafka cluster.

================================================== ===================================

2. Topics and topic partitions

Agents Broker can help high-end manufacturers to “purchase, sell and stock” goods, but there is a problem. The goods are not classified.

Obviously, the sales cycle and sales frequency of different products such as Moutai, dairy products, and pork are different. In order to effectively arrange the purchase, sale and inventory of goods, it is necessary to classify the goods.

In the same way, some of the messages received by kafka need to be processed quickly, and some are high-frequency data with low timeliness requirements. Therefore, the message data needs to be classified, and each classification is called a Topic.

The author thinks that the word Topic would be better translated as “channel” here, but more people have recognized the translation method of “topic”, so I take it first.

The middle solid line in the figure below is a Broker with a single kafka instance, including three topics. That is to say, this agent represents three commodities: alcohol, dairy products, and snacks. Topic is a logical concept used to classify messages.

In the above figure, the dotted range represents the topic, and the pipe-shaped graphic represents the partition of the topic.

Agents classify products according to topics. Then there is another problem. If an agent’s work is single-threaded, it will be stretched when handling high-concurrency work.

Therefore, we introduce the concept of partition to Topic. A partition is a real physical queue data structure used to store data, occupying system memory, disk data storage and other resources.

  • A partition is a partition of a topic, so a topic contains one or more partitions. The number of partitions a Topic contains depends on the throughput capacity requirements for commodity processing under the topic.
  • Because Topic is a logical concept, partitions can also be called “partition agents”. A proxy Broker contains multiple partitions. Just like a provincial agent can contain multiple prefecture-level agents.

Questions:

Kafka works basically the same as other MQs, except for some noun naming differences. For better discussion, here is a brief explanation of these terms. These explanations should give you a rough idea of how kafka MQ works.

  • Producer (P): It is the client that Kafka sends messages to
  • Consumer (C): client that gets messages from kafka
  • Topic (T): can be understood as a queue
  • Consumer Group (CG): This is the method used by Kafka to implement broadcast (sent to all consumers) and unicast (sent to any consumer) of a topic message. A topic can have multiple CGs. The topic’s message will be copied (not really copied, but conceptually) to all CGs, but each CG will only send the message to one consumer in the CG. If you need to implement broadcasting, as long as each consumer has an independent CG. To achieve unicast, all consumers need to be in the same CG. Using CG, consumers can also be grouped freely without the need to send messages to different topics multiple times.
  • Broker (B): A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.
  • Partition(P): In order to achieve scalability, a very large topic can be distributed to multiple brokers (i.e. servers). Kafka only guarantees that messages will be sent to consumers in the order within a partition, and does not guarantee the order of a topic as a whole (among multiple partitions).

================================================== =========================================

Partition copy and high availability

After solving the throughput problem, the following is how to ensure the high availability of the Kafka cluster.

As introduced above: a Kafka cluster contains multiple Broker instances, and usually each Broker instance is deployed on a different server to run independently.

This is one of the common methods to ensure high availability in a distributed architecture: multiple instances of services. If one service instance fails, other instances can provide services to ensure high availability.

Then the second method to ensure high availability of distributed clusters: multiple copies of data. If the data of one copy is lost, other data copies can be used.

Where does kafka data exist? Partitioning, so for Kafka, the high-availability way to ensure that data is not lost is to partition multiple copies.

The combination of multiple instances of Broker service and multiple copies of partitions is the picture above. Explain this picture:

  • 1 wine theme contains 4 partitions, each partition has three copies (one master and two slaves, the same color)
  • Three partition copies are distributed on four Broker service instances
  • The producer only sends message data to the primary partition replica
  • Consumers also only pull message data from the primary partition replica

In addition, the master-slave relationship between the three partition copies of a topic partition in the above figure is not fixed, but is elected based on the status of the Broker service instance where the partition copy is located.

For example, the current status of partition copies A, B, and C is that A is the primary partition copy. If the Broker where partition copy A is located dies, then partition copy A loses its “primary copy qualification” and a new primary partition copy is elected between partition copies B and C.

Therefore, in the cluster shown in the above figure as an example, partition copies are scattered across multiple Broker service instances, so even if one or two service instances hang up, the entire message service will not be unavailable.

Because there is one partition copy left, producers and consumers only communicate with the primary partition copy (Leader), and the slave partition copy (Follower) only serves as a data backup.

1) Producer: The message producer is the client that sends messages to the kafka broker;

2) Consumer: Message consumer, the client that gets messages from kafka broker;

3) Consumer Group (CG): Consumer group consists of multiple consumers. Each consumer in the consumer group is responsible for consuming data from different partitions. A partition can only be consumed by consumers in one group; consumer groups do not affect each other. All consumers belong to a certain consumer group, that is, the consumer group is a logical subscriber.

4) Broker: A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.

5) Topic: It can be understood as a queue, and both producers and consumers are oriented to the same topic;

6) Partition: In order to achieve scalability, a very large topic can be distributed to multiple brokers (i.e. servers). A topic can be divided into multiple partitions**, and each partition is an ordered queue;

7) Replica: Replica. In order to ensure that when a node in the cluster fails, the partition data on the node is not lost and kafka can still continue to work. Kafka provides a replica mechanism. Each partition of a topic has several Copy, one leader and several followers;

8) Leader: The “master” of multiple copies of each partition, the object to which the producer sends data, and the object to which the consumer consumes data are all leaders.

9) Follower: The “slave” in multiple copies of each partition synchronizes data from the leader in real time and maintains synchronization with the leader data. When the leader fails, a follower will become the new leader.

================================================== ===========================

There are multiple brokers (kafka servers) in a cluster

There can be multiple topics in a broker;

A topic has multiple partitions, and multiple partitions are distributed on different servers;

Multiple copies (backup);

Partition copies are distributed across multiple Broker service instances, so even if one or two service instances hang up, the entire message service will not be unavailable.

Because there is one partition copy left, producers and consumers only communicate with the primary partition copy (Leader), and the slave partition copy (Follower) only serves as a data backup.

================================================== ===============================

================================================== ===================================

Quote: https://blog.csdn.net/Aeroever/article/details/130352535

Topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file, and the log file stores the data produced by the producer.

1.What is Topic

Kafka, like ActiveMQ, is an excellent middleware for message subscription/sending. In ActiveMQ, we know that it has the concepts of Queue and Topic,

But in Kafka, there is only the concept of Topic (the Kafka consumer can implement the function of Queue in ActiveMQ through the group.id attribute, see Figure 1)

In Kafka, Topic is a logical concept for storing messages and can be understood as a collection of messages.

Each message sent to the Kafka cluster will have a category that indicates the Topic to which the message is sent.

In terms of storage, messages from different Topics are stored separately. Each Topic can have multiple producers sending messages to it, or multiple consumers can consume messages in the same Topic (see Figure 2)

Replenish:

Here Queue involves the concept of a Consumer Group. (The explanation at groupid=1 in the above picture is a little problematic. For the groupid of this consumer group, please refer to this article 3.Consumer Group consumer group)

2.What is Partition?

Partition means partition in Kafka. Partitioning improves the concurrency of Kafka and also solves the load balancing of data in Topic.

That is: each Topic in Kafka can be divided into multiple partitions (each Topic has at least one partition), and different partitions under the same Topic contain different messages (partitions can be indirectly understood as table partitioning operations of the database).

When each message is added to a partition, it is assigned an offset, which is the unique number of the message in the current partition.

Kafka can guarantee the order of messages in partitions through offset, but cross-partitions are unordered, that is, Kafka only guarantees that messages in the same partition are ordered.

As shown in the figure below, we create a Topic named test through the command (the command is as follows ↓↓↓), partition it, and set up 3 partitions, namely test-0, test-1, and test-2.

When each message is sent to the broker, it will be calculated according to the partition rules of the Partition, and then choose which Partition to store the message in.

If the Partition rules are set appropriately, then all messages will be evenly distributed in different Partitions, which is similar to the concept of database and table sharding in a database, and the data will be sharded.

Question 1: At this point, you may be wondering why the first producer writes the message to test-0, and so on. This involves 5. producer message distribution strategy. Please continue reading.

The command to create a Topic is as follows:

bin/kafka-topics.sh –create –zookeeper 192.168.204.201:2181,192.168.204.202:2181,192.168.204.203:2181 –replication-factor 1 –partitions 3 –topic test

Note: bin/kafka-topics.sh –create —->kafka comes with the command –create means creating a topic

–zookeeper xxx.xxx.xxx.xxx:2181 —->zookeeper cluster address

–replication-factor 1 —->Number of backups (1 backup)

–partitions 3 —->Number of kafka partitions (indicates 3 partitions)

–topic test —->The name of the topic to be created

3.Consumer Group Consumer Group

A consumer group consists of multiple consumers. Each consumer in the consumer group is responsible for consuming data from different partitions. A partition can only be consumed by a certain consumer in a group; consumer groups do not affect each other. All consumers belong to a certain consumer group, that is, a consumer group is a logical subscriber.

For example, a topic has two partitions, partition0 and partition1. There is a consumer group, and there are two consumers in the group, customer0 and customer1.

Customer0 and customer1 in the consumer group can only consume data from a certain partition in the topic. For example, customer0 consumes partition0 and customer1 consumes partition1.

If there is no concept of consumer groups and the topic has two partitions, partition0 and partition1, and only one consumer, customer0, then the data in both partitions, partition0 and partition1, needs to be consumed by customer0. The benefits of consumer groups can increase spending power! ! !

4.Storage of Topic and Partition
This example is introduced using a Kafka cluster built with three servers: 192.168.204.201, 192.168.204.202, and 192.168.204.203.

As shown in the figure below, the topic named test has been created. So how is Partition stored? ?

Partition is stored in the file system in the form of a file. As above, a topic named test is created. We define it to have 3 partitions. Since the partition is stored in the form of a file, where are these 3 partitions stored?

We can find it in kafka’s data directory (/tmp/kafka-log). This directory can be configured by ourselves. In the /tmp/kafka-log directory, we will see 3 directories: test-0, test-1, test-2. The naming convention is topic_name-partition_id. The directory is as shown below:

Question 2: At this point, you may be wondering why 3 partitions are randomly assigned to 3 servers. This will involve the allocation strategy of multiple partitions in the cluster. So how to reasonably distribute multiple partitions in the cluster?

Answer: (1) Sort all N Brokers and i Partitions (in this example N = 3, i = 3)

(2) Assign the i-th Partition to (i % n) Brokers. (In this way, test-1 will be assigned to the first station, and so on)

5.producer message distribution strategy

Message is the most basic data unit in Kafka. In Kafka, a message consists of key and value, and both key and value can be empty.

What is the use of the key here? When we send a message, we can specify this key, and the producer will determine which partition the current message should be sent and stored based on the key and partition mechanism. (Problem 1 is now solved)

What to do if key in Kafka is null? By default, Kafka uses the hash modulo partitioning algorithm. If key is null, a partition will be randomly assigned. This randomization is to randomly select one within the time range of this parameter “metadata.max.age.ms”. For this time period, if key is null, it will only be sent to the only partition. This value is updated every 10 minutes by default.

In addition, Kafka also provides us with an entry point for customizing message distribution strategies. We can customize message distribution strategies according to our own business conditions. So how to implement our own partitioning strategy? We only need to define a class, implement the Partitioner interface, and override its partition method. Then when configuring kafka, just set and use our customized message distribution strategy. How to customize the message distribution strategy, please refer to 4.1 Customizing the Message Distribution Strategy Demo

5.1 Custom message distribution strategy Demo

**
 * 1. Customized partition strategy
 */
public class MyPartition implements Partitioner {
    Random random = new Random();
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        //Get the partition list
        List<PartitionInfo> partitionInfos = cluster.partitionsForTopic(topic);
        int partitionNum = 0;
        if(key == null){
            partitionNum = random.nextInt(partitionInfos.size());//Random partition
        } else {
            partitionNum = Math.abs((key.hashCode())/partitionInfos.size());
        }
        System.out.println("Current Key:" + key + "----->Current Value:" + value + "----->" + "Current storage partition:\ " + partitionNum);
        return partitionNum;
    }
 
    public void close() {
 
    }
 
    public void configure(Map<String, ?> map) {
 
    }
}
/**
 * Under SpringBoot, add the following partitioner.class attribute and specify to use the custom MyPartition class.
 */
spring:
  kafka:
    properties:
      partitioner.class: com.report.kafka.partition.MyPartition
 
/**
 * Spring uses xml or annotation form, just configure the following attributes
 */
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG,"com.report.kafka.partition.MyPartition");

6. How consumers consume specified partition messages

At this time, the topic named test has three partitions, namely 0, 1, and 2. If we want to consume the messages in partition 0, how should we consume them? There are two ways to operate kafka using Java: spring-kafka.jar and kafka-clients.jar. The two methods are introduced below to complete the consumption of messages from the specified partition.

/**
 * 1. Use the KafkaTemplate type in the spring-kafka.jar package
 * Use @KafkaListener annotation method
 * As follows: It means that the messages in partition 1 are consumed under the topic named test.
 */
@KafkaListener(topicPartitions = {@TopicPartition(topic = "test",partitions = {"1"})})
 
/**
 * 2. Use the KafkaConsumer type in the kafka-clients.jar package
 * As follows: It means that the messages in partition 1 are consumed under the topic named test.
 */
TopicPartition topicPartition = new TopicPartition("test" , 1);
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.assign(Arrays.asList(topicPartition));

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Java Skill TreeHomepageOverview 139142 people are learning the system