03 | Serialization: How are objects transmitted over the network?

The previous lecture explained how to design an extensible and backward-compatible protocol in the RPC framework. The key point is to make good use of the extension fields in the Header and the extension fields in the Payload to achieve backward compatibility through the extension fields.

So following the key point of the previous lecture, today I will explain serialization in the RPC framework. You must know that rational selection of serialization methods in different scenarios is crucial to improving the overall stability and performance of the RPC framework.

Why is serialization needed?

First, we need to know what serialization and deserialization are.

Let’s first review the RPC principles introduced in [Lecture 01]. When describing the RPC communication process, we said:

The data transmitted over the network must be binary data, but the incoming and outgoing parameters requested by the caller are all objects. The object cannot be transmitted directly over the network, so we need to convert it into a transmittable binary in advance, and the conversion algorithm is required to be reversible. This process is generally called “serialization”. At this time, the service provider can correctly segment different requests from the binary data, and at the same time, reversely restore the binary message body to the request object according to the request type and serialization type. This process is called “deserialization” “.

These two processes are shown in the figure below:

In summary, serialization is the process of converting objects into binary data, and deserialization is the process of converting binary data into objects.

So why does the RPC framework need serialization? Please recall the communication process of RPC:

Let me use an example to help you understand. For example, when sending express delivery, we need to send an object that needs to be assembled by ourselves. Before the sender sends the object, he will unbox and pack it, which is like serialization; when the courier comes, he cannot bump it, so he has to package it. This is like encoding the serialized data and encapsulating it into A fixed-format protocol; after two days, when the recipient receives the package, he will unbox it and splice the objects together. This is like protocol decoding and deserialization.

So is it clear now? Because the data transmitted over the network must be binary data, serialization and deserialization of input parameter objects and return value objects is a necessary process in RPC calls.

What are the commonly used serializations?

So from this point of view, do you think this process is very simple? It’s actually not the case, it’s very complicated. We can first take a look at the commonly used serialization. Next, we will briefly introduce several commonly used serialization methods.

JDK native serialization

If you can develop using the Java language, you must know JDK’s native serialization. The following is an example of JDK serialization:

import java.io.*;

public class Student implements Serializable {
    //student ID
    private int no;
    //Name
    private String name;

    public int getNo() {
        return no;
    }

    public void setNo(int no) {
        this.no = no;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return "Student{" +
                "no=" + no +
                ", name='" + name + ''' +
                '}';
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException {
        String home = System.getProperty("user.home");
        String basePath = home + "/Desktop";
        FileOutputStream fos = new FileOutputStream(basePath + "student.dat");
        Student student = new Student();
        student.setNo(100);
        student.setName("TEST_STUDENT");
        ObjectOutputStream oos = new ObjectOutputStream(fos);
        oos.writeObject(student);
        oos.flush();
        oos.close();

        FileInputStream fis = new FileInputStream(basePath + "student.dat");
        ObjectInputStream ois = new ObjectInputStream(fis);
        Student deStudent = (Student) ois.readObject();
        ois.close();

        System.out.println(deStudent);

    }
}

We can see that the serialization mechanism that comes with JDK is very simple for users. The specific implementation of serialization is completed by ObjectOutputStream, and the specific implementation of deserialization is completed by ObjectInputStream.

So how is the serialization process of JDK completed? Let’s take a look at the picture below:

ObjectOutputStream serialization process diagram

The serialization process is to continuously add some special delimiters when reading object data. These special delimiters are used for truncation during the deserialization process.

Header data is used to declare the serialization protocol and serialization version for backward compatibility between high and low versions.

Object data mainly includes class name, signature, attribute name, attribute type and attribute value, and of course the beginning and end of data. Except for the attribute values that are real object values, the rest are metadata for deserialization.

In the case of object reference and inheritance, the “write object” logic is recursively traversed.

In fact, the core idea of any serialization framework is to design a serialization protocol to write the object type, attribute type, and attribute value one by one into a binary byte stream in a fixed format. After serialization is completed, the object type, attribute type, and attribute value are read out one by one according to the fixed format, and a new object is re-created using this information to complete deserialization.

JSON

JSON may be the serialization format we are most familiar with. JSON is a typical Key-Value method. It has no data type and is a text-based serialization framework. There is a lot of relevant information on the Internet about the specific format and characteristics of JSON. It will not be introduced here.

It has a wide range of applications. Whether it is Ajax calls for front-end Web, text-type data stored on disk, or RPC framework communication based on HTTP protocol, JSON format will be chosen.

However, there are two problems with serializing with JSON, which require special attention:

The additional space overhead of JSON serialization is relatively large, which means huge memory and disk overhead for large data volume services;

JSON has no type, but strongly typed languages like Java need to be solved uniformly through reflection, so the performance is not very good.

Therefore, if the RPC framework uses JSON serialization, the amount of data transmitted between the service provider and the service caller must be relatively small, otherwise performance will be seriously affected.

Hessian

Hessian is a serialization framework that is dynamically typed, binary, compact, and portable across languages. The Hessian protocol is more compact than JDK and JSON, and its performance is much more efficient than JDK and JSON serialization, and the number of bytes generated is also smaller.

The usage code example is as follows:

Student student = new Student();
student.setNo(101);
student.setName("HESSIAN");

//Convert the student object into a byte array
ByteArrayOutputStream bos = new ByteArrayOutputStream();
Hessian2Output output = new Hessian2Output(bos);
output.writeObject(student);
output.flushBuffer();
byte[] data = bos.toByteArray();
bos.close();

//Convert the byte array just serialized into a student object
ByteArrayInputStream bis = new ByteArrayInputStream(data);
Hessian2Input input = new Hessian2Input(bis);
Student deStudent = (Student) input.readObject();
input.close();

System.out.println(deStudent);

Compared with JDK and JSON, Hessian is more efficient, generates smaller bytes, and has very good compatibility and stability. Therefore, Hessian is more suitable as a serialization protocol for remote communication in the RPC framework.

But Hessian itself also has problems. The official version does not support some common object types in Java, such as:

Linked series, LinkedHashMap, LinkedHashSet, etc., but can be fixed by extending the CollectionDeserializer class;

Locale class, which can be fixed by extending ContextSerializerFactory class;

Byte/Short becomes Integer when deserialized.

The above situations require special attention in practice.

Protobuf

Protobuf is Google’s internal mixed-language data standard. It is a lightweight and efficient structured data storage format that can be used for structured data serialization and supports Java, Python, C++, Go and other languages. When using Protobuf, you need to define IDL (Interface description language), and then use IDL compilers in different languages to generate serialization tool classes. Its advantages are:

The volume after serialization is much smaller than JSON and Hessian;

IDL can clearly describe the semantics, so it is enough to help and ensure that types are not lost between applications, without the need for an XML parser;

Serialization and deserialization are very fast, and there is no need to obtain types through reflection;

The message format upgrade and compatibility are good and can be backward compatible.

The usage code example is as follows:

/**
 *
 * // IDl file format
 * synax = "proto3";
 * option java_package = "com.test";
 * option java_outer_classname = "StudentProtobuf";
 *
 * message StudentMsg {
 * //serial number
 * int32 no = 1;
 * //Name
 * string name = 2;
 * }
 *
 */
 
StudentProtobuf.StudentMsg.Builder builder = StudentProtobuf.StudentMsg.newBuilder();
builder.setNo(103);
builder.setName("protobuf");

//Convert the student object into a byte array
StudentProtobuf.StudentMsg msg = builder.build();
byte[] data = msg.toByteArray();

//Convert the byte array just serialized into a student object
StudentProtobuf.StudentMsg deStudent = StudentProtobuf.StudentMsg.parseFrom(data);

System.out.println(deStudent);

Protobuf is very efficient, but for languages with reflection and dynamic capabilities, it is very laborious to use, which is not as good as Hessian. For example, if you use Java, this pre-compilation process is not necessary, and you can consider using Protostuff.

Protostuff does not need to rely on IDL files and can directly de/serialize Java domain objects. The efficiency is similar to Protobuf. The generated binary format is exactly the same as Protobuf. It can be said to be a Java version of the Protobuf serialization framework. However, during use, I encountered some unsupported situations, and I will also give you the following:

null is not supported;

ProtoStuff does not support simple Map and List collection objects, they need to be wrapped in objects.

How to choose serialization in RPC framework?

I have just briefly introduced several of the most common serialization protocols. In fact, there are far more than these, including Message pack, kryo, etc. So faced with so many serialization protocols, how should we choose in the RPC framework?

The first thing that may come to mind is performance and efficiency. Yes, this is indeed a factor worth referring to. As mentioned just now, the serialization and deserialization process is a necessary process for RPC calls, so the performance and efficiency of serialization and deserialization are bound to be directly related to the overall performance and efficiency of the RPC framework.

Apart from this, what else came to mind?

Yes, there is also space overhead, which is the size of the binary data after serialization. The smaller the size of the serialized byte data, the smaller the amount of data transmitted over the network, and the faster the data transmission speed. Since RPC is a remote call, the speed of network transmission will be directly related to the time it takes to respond to the request.

Now please think again, what other factors can affect our choice?

That’s right, it’s the versatility and compatibility of the serialization protocol. In the operation of RPC, the serialization problem is probably the most encountered and answered problem. Businesses often feedback this problem. For example, the caller of a certain type of input parameter service cannot parse the collection class, and the service provider After adding an attribute to the parameter class, the service caller cannot call normally. After upgrading the RPC version, when making a call, a serialization exception occurs…

In the choice of serialization, compared with the efficiency, performance, and volume of the serialization protocol, its versatility and compatibility will have a higher priority, because it will be directly related to the stability of service calls. In terms of availability and availability, service reliability is obviously more important for service performance. We pay more attention to whether the compatibility of this serialization protocol after version upgrade is very good, whether it supports more object types, whether it is cross-platform and cross-language, and whether many people have used it and stepped on many pitfalls. , and secondly we will consider performance, efficiency and space overhead.

There is one more point that should be emphasized. In addition to the versatility and compatibility of the serialization protocol, the security of the serialization protocol is also a very important reference factor and should even be considered in the first place. Take JDK native serialization as an example, it has vulnerabilities. If there are security vulnerabilities in serialization, online services are likely to be compromised.

Based on the above reference factors, let us now summarize these serialization protocols.

Our first choice is Hessian and Protobuf, because they meet our requirements in terms of performance, time overhead, space overhead, versatility, compatibility and security. Among them, Hessian is more convenient to use and has better object compatibility; Protobuf is more efficient and has more advantages in versatility.

What issues should we pay attention to when using the RPC framework?

Now that we understand how to choose serialization in the RPC framework, what serialization issues do we need to pay attention to during use?

I just said that in the operation of RPC, the most common problem I encountered was the serialization problem. In addition to the problems that occurred in the early RPC framework itself, most of the problems were caused by incorrect use by the user. Next, we Let’s take stock of these frequently occurring human-made problems.

The object structure is too complex: there are many attributes, and there are multiple levels of nesting. For example, object A is associated with object B, object B aggregates object C, object C aggregates many other objects, and object dependencies Too complex. When the serialization framework serializes and deserializes objects, the more complex the object, the more it wastes performance and consumes CPU, which will seriously affect the overall performance of the RPC framework; in addition, the more complex the object, the more time it takes to serialize and deserialize the object. , the higher the probability of problems.

The object is too large: I often encounter business inquiries about why their RPC requests often time out. After investigation, I found that their input parameter objects are very large, such as a large List or a large Map or a sequence. After transformation, the byte length reached several megabytes. This situation will also seriously waste performance and CPU, and serializing such a large object is very time-consuming, which will definitely directly affect the time of the request.

Use classes that are not supported by the serialization framework as input classes: For example, the Hessian framework does not naturally support LinkedHashMap, LinkedHashSet, etc., and in most cases it is best not to use third-party collection classes, such as Guava Among the collection classes, many open source serialization frameworks give priority to supporting objects native to the programming language. Therefore, if the input parameter is a collection class, you should try to use the native and most commonly used collection classes, such as HashMap and ArrayList.

Objects have complex inheritance relationships: Most serialization frameworks will serialize the properties of the object one by one when serializing the object. When there is an inheritance relationship, they will constantly search for the parent class and traverse Attributes. Just like question 1, the more complex the object relationship is, the more performance is wasted, and serialization problems are prone to occur.

In the process of using the RPC framework, we should try to build simple objects as input parameters and return value objects to avoid the above problems.

Summary

Today we learned in depth what serialization is and introduced several common serialization methods such as JDK native serialization, JSON, Hessian, and Protobuf.

In addition to these basic knowledge, we focused on how to choose a serialization protocol in the RPC framework. We have several very important reference factors. The priorities from high to low are security, versatility and compatibility. Later we will consider the performance, efficiency and space overhead of the serialization framework.

Ultimately, this is because the stability and reliability of service calls are more important than the performance and response time of the service. In addition, for RPC calls, most of the most time-consuming and performance-consuming operations in the overall call are the operations of the service provider executing business logic. At this time, the serialization overhead has a relatively small impact on the overall service overhead.

In the process of using the RPC framework, we construct input parameter and return value objects, mainly remembering the following points:

1. The object should be as simple as possible, without too many dependencies, not with too many attributes, and as high cohesion as possible;

2. The size of the input parameter object and return value object should not be too large, let alone a collection that is too large;

3. Try to use simple, commonly used objects that are native to the development language, especially collection classes;

4. Objects should not have complex inheritance relationships, and it is best not to have parent-child classes.

In fact, although the RPC framework allows us to initiate remote calls just like calling local calls, the fundamental role of the return value in the transmission process of the RPC framework is to transmit information. In order to improve the overall performance and stability of RPC calls It is important that our input and return value objects be constructed as simple as possible.