07｜Integrating data: How are enumerations, structures and unions implemented?

The C language provides us with a certain level of abstraction above machine instructions, which allows us to build applications in a way that is close to natural language. If using C language is to build a house with bricks, then using other high-abstract granular programming languages is to build it directly with walls as units. Obviously, from this perspective, C language is not as convenient to use as other high-level languages, but it also gives us a finer construction granularity, allowing us to flexibly customize the shape of the wall according to our own ideas.

For the bricks and walls mentioned here, they can be simply understood as the data types used by programming languages when building programs. For example, in the Python language, we can use complex data types such as sets and dicts. In the Java language, Map itself will be subdivided into HashMap, LinkedHashMap, EnumMap and other types for use in different application scenarios.

In order to keep itself simple while ensuring high enough flexibility, the C language provides us with structures (struct), unions (unions) and enumerations (enum) on the basis of providing basic numerical types and pointer types. Three types. Combining these types allows us to combine small “bricks” so that they can be spliced into larger, complex building units with specific functional structures.

Next, let’s take a look: How does the compiler implement these three data types behind the scenes? In terms of implementation, in order to take into account the performance requirements of the program, what special optimizations has the compiler made?

Enumeration

In programming languages, the data type Enumeration can be defined by programmers to represent a certain type of abstract concept with a limited range of possible values.

Let’s look at a classic example: How should we use programming language to represent the concept of “weekday”?

The working day is an abstract concept in the real world. It contains five valid values from Monday to Friday. Unlike concepts such as numerical values and characters, it cannot directly correspond to any software and hardware implementation in a physical computer. Therefore, in order to express this kind of information more accurately in the program, you can use enumerations to customize the corresponding types.

In C language, it can be implemented like this:

For ease of observation, the C code and its corresponding assembly code are shown directly. As you can see, the compiler does not generate any machine instructions for the enumeration type definition in the red box on the left. In fact, in C language, the enumeration value in every custom enumeration type is stored as int type. Therefore, these enumeration values are sometimes called “named integer types”. You can see from the assembly code in the blue box on the right side of the figure above that when function foo is called, the enumeration value Mon passed in corresponds to the literal number 0 passed in through the edi register. That is, the enumeration value Mon is represented under the hood by the numeric value 0.

Similarly, in line 11 of the C code on the left, a generic macro is also used to determine the specific type of the enumeration value Mon. You can try running this code and observe the output of the program to verify our conclusion.

It should be noted that the C standard directly treats enumeration values as integers, which may cause us to encounter unexpected problems when building programs. For example, for the above C code, when function foo is called, it is actually allowed to pass in any value that can be implicitly converted to int type, even if the value comes from a variable of another enumeration type. Therefore, letting enumeration types help organize program code and ensure that it is not misused is also an issue we need to pay attention to when building high-quality programs.

Structure

In C language, arrays are used to store a cluster of data of the same type in a contiguous memory segment. The structure (Struct) is actually similar to it, except that inside the structure, we can store different types of data. Let’s look at a piece of code first:

In the C code on the left side of the image above, we define a structure called S . For each object of structure S, three data values of completely different types are continuously stored inside, namely a character pointer, a character value, and a long integer value.

In line 10 of the code, we construct an object s of the structure S through bracket list initialization. Through the assembly code in the blue box on the upper right, you can see how the compiler initializes it. Essentially, a structure is just an encapsulation of the various data contained within it. Therefore, from the perspective of the compiled product, it only needs to store the data it encapsulates continuously in memory. This is indeed the case. The initialization process of the three data inside the structure S is completed by the instruction mov, and these data are initialized in the stack memory.

There is no doubt that the data items in the structure are initialized in memory, but are they really “contiguous”?

In order to verify this problem, we printed out the size of the structure S through the sizeof operator on line 12 of the C code on the left. The way structure S is defined and our understanding of the word “contiguous”, its size should be 17 bytes on the x86-64 platform. Among them, the character pointer is 8 bytes, the character is 1 byte, and the last long integer value is 8 bytes. But after looking at the assembly code in the yellow box on the right, you will find that this is not the case: each object of structure S actually occupies up to 24 bytes of memory. So why is this?

By sorting out the assembly code used during initialization of object s, we can get the actual layout of its internal member fields in the stack memory. After sorting, you can get the following picture:

From left to right, this picture represents the growth direction of stack memory (high address -> low address). Among them, the register rsp points to the low address at the top of the stack, and the rbp register points to the high address at the beginning of the stack frame. According to the instructions in the assembly code, the character pointer p is located at [rbp-32] and occupies 8 bytes; the character c is located at [rbp-24] and occupies 1 byte. The long integer variable x is located at [rbp-16] and occupies 8 bytes.

It can be seen that the compiler does not actually “place” these three data values in a strictly consecutive manner. Among them, the 7 bytes between [rbp-25] to [rbp-16] do not store any data. . An important purpose for the compiler to do this is for “data alignment”.

Memory data alignment

For modern computers, when the address of the data that needs to be read or written in the memory satisfies “natural alignment”, the CPU can usually perform data operations with the highest efficiency. The so-called natural alignment means that the address of the manipulated data is an integer multiple of the data size. For example, in the x86-64 architecture, if the value of an int type variable is stored continuously in memory, and the address of the least significant byte (LSB) is an integer multiple of 4, then we can say that the value of the variable are aligned in memory.

Why can natural alignment maximize the memory reading efficiency of the CPU? This is actually related to many restrictive factors in the development of core hardware related to memory reading and writing, such as CPU and MMU (memory management unit). For example, some older Sun SPARC and ARM processors can only access aligned data located at specific addresses, and will generate exceptions for unaligned data accesses. On the contrary, some processors can support access to non-aligned data, but due to design and process limitations, access to these data requires more clock cycles.

Therefore, in order to adapt the code to the “style” of different processors and ensure that the data in the memory meets the natural alignment requirements, it has become a default consensus reached by most compilers when generating machine instructions. Even on today’s modern x86-64 processors, the performance penalty of accessing unaligned data is insignificant in most cases.

Padding bytes

Let’s go back to the previous example. It can be seen that in order to ensure that all member fields in the object s meet the natural alignment requirements in the stack memory, the compiler will insert additional “padding bytes” to dynamically adjust the start of the data corresponding to each field in the structure object. Location.

In addition, in some cases, additional padding bytes may be added even if the individual data members within the structure object meet the requirements of natural alignment. For example, the following example:

struct Foo {
  char *p; // 8 bytes.
  char c; // 1 bytes.
  // (padding): 7 bytes.
};

It can be seen here that the two member fields in the structure Foo already meet the requirements of natural alignment by default (assuming that the storage starting position of the character pointer p meets 8-byte alignment). But actually, when we evaluate it via the sizeof operator, we get a result of size 16 bytes instead of the intuitive 9 bytes.

The reason why this phenomenon occurs is because the compiler wants to ensure that when structure objects are stored continuously (such as through an array), the end position of the previous object can just satisfy the natural starting position of the next object. Alignment requirements. This also requires that the size of the structure object itself must be an integer multiple of the size of its internal largest member. Therefore, the compiler pads the appropriate bytes after the last member of the structure to satisfy this condition. It can be said that the structure object in this case has satisfied the natural alignment conditions in different scenarios. Therefore, the structure size at this time will also be used as the final calculation result of the sizeof operator.

United

Finally, let’s take a look at the third powerful data type in C language, “Union”. The syntactic usage of union and “structure” is very similar, except that the corresponding syntax keyword needs to be replaced from struct to union . In addition, there is a big difference between the two. We can start with the name “United”. As the name suggests, “union” means that all data fields defined in the structure will be united to share the same memory area. Let’s look at a piece of code first:

Here, in the C code on the left, the union is encapsulated using the “Tagged Union” pattern. Unlike structures, for each individual union object, we have no way of knowing which internal fields are in effect at a certain moment. Therefore, the use of Tagged Union requires us to set a separate “tag” for each union to clearly indicate the fields that are currently in effect within the union. In this case, we need to encapsulate the label and union to “bind” them.

As you can see, here inside the structure S, the enumeration type field type is used to mark the type of data stored in the current anonymous union. Within the following anonymous union, the integer member i and the character member c share the memory space of the union. This is the basic use of Tagged Union in C language.

The size of a union object is the same as the size of the largest member contained in the union’s internal definition, so in the above example, the size of the anonymous union in structure S is the same as the size of the integer parameter i within the union definition. This size is 4 bytes on x86-64 platforms.

From the assembly code in the blue box on the right side of the picture, we can also get the same conclusion. The first line of code sets all the 8-byte space occupied by the entire structure object s to zero to prepare for the subsequent anonymous union object assignment; the second line of code assigns the value 1 corresponding to the enumeration type CHAR to the structure object s. The enumeration field type; the third line of code stores the value 97 corresponding to the character “a” into the anonymous union object in the structure object s. It can be seen here that the instruction mov uses BYTE in the destination parameter when transmitting data, that is, it “takes out” 1 byte from the 4-byte space occupied by the union object and uses it as a storage character value. target memory space.

Summary

This lecture mainly introduces the three data types of enumeration, structure and union in C language, and explores their specific implementation at the machine instruction level.

An enumeration is a data type used to represent abstract entities with a limited range of possible values. The enumeration value in the enumeration type is also called a “named integer”, so it can be used directly as an integer value in C code. Likewise, enumeration values are directly replaced by their corresponding integer values in compiler-generated code. However, it should be noted that when coding in C, the enumeration value and its corresponding integer value must not be misused.

A structure is a composite data type used to organize heterogeneous data. In a structure, all defined data fields are arranged sequentially in memory. In order to ensure the most efficient data access speed for each field in the structure, the compiler will ensure that their starting addresses meet the natural alignment standards when laying out these field data in memory. Therefore, the different definition order of the fields in the structure will directly affect the actual memory footprint of the structure object, and this is also an important entry point for us to optimize our program.

A union is a special composite data type, and all data fields defined within it will occupy the same memory space. The actual size of a union object is the same as the size of the largest field defined within it. By default, the field types that are “in effect” in a union object cannot be known from the outside, so the use of Tagged Union has become mainstream. By “packaging” the enumeration type used to identify the valid fields with the union, we can make corresponding judgments and preparations before using the union object, and this also lays the foundation for the robustness of the application.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. C Skill Tree Home Page Overview 184596 people are learning the system