15 | Representatives of streaming computing: Storm, Flink, Spark Streaming

The big data technology introduced earlier mainly processes and calculates large-scale data on storage media. This type of calculation is also called big data batch computing. As the name implies, the data is calculated in batches, such as one day’s access logs, all order data in history, etc. These data are usually stored on the disk through HDFS and calculated using a batch big data computing framework such as MapReduce or Spark. Generally, it takes several minutes to several hours to complete a calculation.

In addition, there is also a big data technology that performs real-time calculation and processing on large-scale data generated in real time. We are more familiar with real-time video data collected by cameras, order data generated in real time by Taobao, etc. In first-tier cities like Shanghai, the number of cameras in public places is in the millions. Even if only the video data of important places needs to be processed in real time, there may be hundreds of thousands of cameras involved. If you want to discover wanted criminals or criminals appearing in the videos in real time, Vehicles that violate regulations need to process the data generated by these cameras in real time. The biggest difference in real-time processing is that this type of data is different from the data stored in HDFS. It is transmitted in real time, or figuratively speaking, it flows over. Therefore, the real-time processing system for this type of big data is also called a big data flow computing system. .

Currently, the well-known big data stream computing frameworks in the industry include Storm, Spark Streaming, and Flink. Next, let’s take a look at their architectural principles and usage methods one by one.

Storm

In fact, the demand for real-time processing of big data has been around for a long time. At the earliest, we used message queues to implement real-time processing of big data. If the processing is more complicated, then many message queues will be needed to implement producers and consumers with different business logic. String together. This process looks similar to the picture below.

The message queue in the figure is responsible for completing the flow of data; the processing logic is both a consumer and a producer, that is, it not only consumes data from the previous message queue, but also generates data for the next message queue. Such a system can only be developed according to different needs, and a similar system needs to be redeveloped for each new need. Because the processing logic of producers and consumers of different applications is different, the processing processes are also different, so this system cannot be reused.

After that, we will naturally think about whether we can develop a stream processing computing system. We only need to define the processing flow and the processing logic of each node. After the code is deployed to the stream processing system, it can follow the predefined processing flow and processing. What about logic execution? Storm was born under this background, and it is also an early big data stream computing framework. If the above example is implemented using Storm, the process becomes simpler.

With Storm, developers no longer need to pay attention to data flow, message processing and consumption. They only need to program and develop logical bolts for data processing and logical spouts for data sources, as well as the topology between them, and submit them to Storm. Just run it.

After understanding the operating mechanism of Storm, let’s take a look at its architecture. Storm, like Hadoop, also has a master-slave architecture.

nimbus is the master of the cluster and is responsible for cluster management, task allocation, etc. Supervisor is a Slave, which is where calculations are actually completed. Each supervisor starts multiple worker processes, and each worker runs multiple tasks, and the task is a spout or bolt. Supervisor and nimbus complete tasks allocation, heartbeat detection and other operations through ZooKeeper.

The design concepts of Hadoop and Storm are actually the same, which is to extract things that have nothing to do with specific business logic and form a framework, such as big data sharding processing, data flow, task deployment and execution, etc. Developers You only need to develop business logic code according to the constraints of the framework and submit it to the framework for execution.

And this is exactly the development concept of all frameworks, which is to separate business logic and processing so that developers only need to focus on business development. For example, frameworks such as Tomcat and Spring, which are familiar to Java developers, are all Developed based on this concept.

Spark Streaming

We know that Spark is a batch processing big data computing engine, which mainly performs calculations on large batches of historical data. I introduced it before when I talked about the principles of Spark architecture. Spark is a big data engine for fast calculation. It fragments the original data and loads it into the cluster for calculation. For calculations where the amount of data is not very large and the process is not very complicated, it can be done in Processing is completed in seconds or even milliseconds.

Spark Streaming cleverly takes advantage of Spark’s sharding and fast computing features to segment the data transmitted in real time according to time and merge the data transmitted over a period of time. Together, they are treated as a batch of data and then handed over to Spark for processing. The picture below describes the process of Spark Streaming segmenting and batching data.

If the time periods are divided into small enough, the amount of data in each period will be relatively small, and the processing speed of the Spark engine is fast enough, so it will appear as if the data is processed in real time. This is the real-time stream calculation of Spark Streaming. secret.

What should be noted here is that when initializing the Spark Streaming instance, you need to specify the segmentation interval. In the code example below the interval is 1 second.

val ssc = new StreamingContext(conf, Seconds(1))

Of course, you can also specify a smaller time interval, such as 500ms, so that the processing speed will be faster. The setting of time interval usually takes into account the business scenario. For example, if you want to count the traffic flow on the highway every minute, then the time interval can be set to 1 minute.

Spark Streaming is mainly responsible for converting streaming data into small batches of data, and Spark can do the rest.

Flink

As mentioned earlier, Spark Streaming divides the real-time data stream into time segments and calculates it as small batch data. On the contrary, Flink was designed based on stream processing calculations from the beginning. When the data read from the file system (HDFS) is treated as a data stream, it becomes a batch processing system.

Why can Flink support both stream processing and batch processing?

If you want to perform stream computing, Flink will initialize a stream execution environment StreamExecutionEnvironment, and then use this execution environment to build the data stream DataStream.

StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

If you want to perform batch calculations, Flink will initialize a batch execution environment ExecutionEnvironment, and then use this environment to build a data set DataSet.

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<String> text = env.readTextFile("/path/to/file");

Then perform various data transformation operations (transformation) on the DataStream or DataSet, which is very similar to Spark. Whether it is stream processing or batch processing, the execution engine of Flink runtime is the same, but the data source is different.

The way Flink processes real-time data streams is also very similar to Spark Streaming. It also segments the stream data and processes it in small batches. Stream processing is considered a “first-class citizen” in Flink, and Flink’s support for stream processing is also more complete. It can perform window operations on the data stream, split the data stream into windows one by one, and then perform calculations.

Execute on data stream

.timeWindow(Time.seconds(10))

The data can be divided into a 10-second time window, and a batch of data in this window can be further statistically summarized.

The architecture of Flink looks very similar to Hadoop 1 or Yarn. JobManager is the manager of the Flink cluster. After the Flink program is submitted to JobManager, JobManager checks the resource utilization status of all TaskManagers in the cluster. If there is an idle TaskSlot (task slot), it Assign computing tasks to it for execution.

Summary

When big data technology first appeared, it was only for batch computing, that is, offline computing. Relatively speaking, big data real-time computing can reuse the Internet’s real-time online business processing technology solutions. After all, for Google, billions of user search access requests every day are also big data, and Internet applications already have a set of processes for real-time high-concurrency requests. A complete solution, the demand for big data stream computing was not strong at that time.

However, when we look at the history of computer software development, we find that this history can be called a history of the continuous separation of technology and business. People continue to separate business logic from technical implementation, and the emergence of various technologies and architectural solutions basically serve this goal.

In the earliest days, we used machine language and assembly language programming, and directly implemented business logic using CPU instructions. Computer software is a collection of CPU instructions. At this time, technology and business are completely coupled. Software programming is machine-oriented programming, using machine instructions to complete business logic. , when we were programming, our thinking was machine-oriented and we needed to memorize machine instructions.

Later, we had operating systems and high-level programming languages, which separated software from CPU instructions. We used high-level programming languages such as Pascal and Cobal for programming, and ran the program on the operating system. At this time, we are no longer oriented towards machine programming, but towards business logic and process programming. This is an important separation between business logic and computer technology.

Later, object-oriented programming languages appeared, which was a milestone in the history of human programming. When we program, our focus shifts from machines and business processes to the business objects themselves, analyzing the relationship and collaboration between business objects in the objective world, and how to map it to software through programming. This is a revolution in programming thinking. Business and Technical implementation is separated from ideas.

Later, various programming frameworks appeared, which on the one hand made the separation of business and technology more complete. Imagine what it would be like if you did not use these frameworks and programmed yourself to monitor the 80 communication port, starting from obtaining the HTTP binary stream to developing a Web application. Feel. On the other hand, these frameworks also decouple the complex business processes themselves. View, business, service, and storage modules at each level are independently developed and deployed, and integrated into a system through the framework.

Back to stream computing, although we can use various distributed technologies to achieve real-time stream processing of large-scale data, we hope that as long as we develop business for small amounts of data and then throw it into a large-scale server cluster, we can process large-scale data. Large-scale real-time data is processed by stream computing. That is to say, business implementation and big data stream processing technology are separated, and the business does not need to pay attention to technology, so various big data stream computing technologies have emerged as the times require.

In fact, when we look at Internet application development, we are gradually moving towards the separation of business and technology. For example, cloud computing provides various distributed solutions to developers in the form of cloud services, so that developers do not need to pay attention to the deployment and maintenance of distributed infrastructure. Currently popular technical solutions such as microservices, containers, service orchestration, and Serverless go one step further, allowing developers to focus only on business development and separating technical solutions such as business processes, resource scheduling, and service management. FaaS, which is fashionable in the field of Internet of Things, means function as a service. As long as developers develop functions, they can be automatically deployed to the entire Internet of Things cluster and run after submission.

In short, stream computing is to unify the resource management and data flow of large-scale real-time computing. Developers only need to develop data processing logic for small amounts of data and then deploy it on the stream computing platform to stream large-scale data. The formula is calculated.