The relationship between various encoding formats (GB2312, GBK, GB18030, unicode, utf-8)

Common encoding formats for Chinese characters

To display characters on the screen. The following steps are required:

Make fonts corresponding to all characters. For example, what does the capital letter A look like. This appearance is the final graphic displayed on the screen, which is the character A we see.
To encode all characters. For example, the encoding of the uppercase letter A is 0x41.
Since the number of characters is much larger than one byte, when character encoding is stored and transmitted, a format needs to be specified for them. How to determine whether the encoding of a certain character is one byte or multiple bytes. For example, 0xC0 0xEE represents the encoding of one character (0xC0EE) or the encoding of two characters (0xC0, 0xEE).

The first step is to create our common fonts, such as 宋体, 楷体, etc. It only cares about what the characters look like. Some of the second and third steps are done together. Such as GB2312, GB18030 encoding methods. Some are separated. For example, unicode encoding refers to the first step, and the third step is handed over to utf-8, utf-16, utf- 32etc.

The process of the computer displaying characters on the screen is to load the stored text, parse the character encoding through encoding, and then find the corresponding character font through the character encoding, and then display it. Since the development of computers is iterative step by step, and because the previous coding methods cannot meet the subsequent work needs, many coding methods have emerged. Such as our common GB2312, utf-8, etc. The following figure is a relationship diagram between commonly used encoding methods:

GB18030, GBK, GB2312 Chinese character encoding format

GB2312 encoding

GB2312 encoding is the first national standard for Chinese character encoding. It was released by the State Administration of Standards of China in 1980 and came into use on May 1, 1981. GB2312 encoding contains a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters. At the same time, the GB2312 encoding includes 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters. . It uses 2-byte encoding, and the encoding range is A1A1~FEFE. The first byte is called the area, and the last byte is called the bit. There are a total of A1~FE94 areas, each area contains A1~FE94 bits. A total of 94*94=8836 code points.

Areas 1~9 contain 682 characters except Chinese characters.
Areas 10-15 are blank areas and are not used.
Areas 16-55 contain 3755 first-level Chinese characters, sorted by pinyin.
Areas 56-87 contain 3008 second-level Chinese characters, sorted by radical/stroke.
Areas 88-94 are blank areas and are not used.

GBK encoding

GB2312 supports too few Chinese characters. The 1995 Chinese character expansion specification GBK1.0 includes 21,886 symbols, which is divided into Chinese character area and graphic symbol area. The Chinese character area includes 21003 characters. K in GBK is the initial consonant of Chinese Pinyin 开, which means expansion. The full English name is Chinese Internal Code Specification. The GBK encoding standard is compatible with GB2312. GB2312 basically meets the computer processing needs of Chinese characters. However, GB2312 cannot handle rare characters in personal names, ancient Chinese, etc. This led to the emergence of GBK and GB18030 Chinese character character sets later. GBK uses double-byte representation, the overall encoding range is 8140-FEFE, the first byte is between 81-FE, and the last byte is between 40-FE, remove a line xx7F. There are a total of 23940 code points, and a total of 21886 Chinese characters and graphic symbols are included.

GB18030 encoding

In 2000, GB18030 replaced GBK1.0 and became the official national standard. Its main feature is that it adds Chinese characters of CJK Unified Chinese Character Expansion A (GB18030-2000) on the basis of GBK. Later, on this basis, the Chinese characters of CJK Unified Chinese Character Expansion B (GB18030-2005) were added.

The GB18030 encoding format has three options: single-byte, double-byte, and four-byte. Among them, single-byte and double-byte encodings are fully compatible with GBK. The 4-byte encoding content is 6582 Chinese characters of CJK extension A.

unicode encoding and format

Unicode official website: https://www.unicode.org

The GBxxx mentioned above are all standards defined by our country. Of course, they only apply to our public Chinese characters and ASCII. The emergence of unicode is to solve the global character encoding. To meet the requirements of cross-language and cross-platform text conversion and processing. In this way, we can write I love you in n languages in an article.

unicode and UCS-2, UCS-4

When Unicode was compiling a universal character set in the early days, the ISO organization was doing the same thing. ISO launched the ISO/IEC 10646 project, called “Universal Multiple-Octet Coded Character Set”, which is translated in Chinese as “Universal Multiple-Octet Coded Characters” Collection”, the English abbreviation of UCS. Later, the two parties integrated, and by the time of Unicode 2.0, Unicode encoding and UCS encoding were basically the same.
UCS-2 uses 16-bit storage space and two bytes to encode each character, while UCS-4 uses 4 bytes (actually only 31 bits are used, the highest bit must be 0) to encode.
UCS-4 is divided into 27=128 groups according to the highest byte with the highest bit being 0. Each group is divided into 256 planes according to the next highest byte. Each plane is divided into 256 rows according to the third byte, and each row contains 256 cells. Of course, the units in the same row only differ in the last byte, and the rest are the same.
Plane 0 of Group 0 is called Basic Multilingual Plane, or BMP for short. It can be seen that the characters in the BMP area only use two bytes, and the code points are from U + 0000 to U + FFFF. It is actually the entire coding range of UCS-2, which was later expanded to UCS-4 due to insufficient code bits.

Among the 17 planes, only planes 0, 1, 2 and 14 are currently used, among which Chinese characters are on planes 0 and 2, and other characters are on planes 0, 1 and 14;
Floor plan
For the specific Unicode encoding range of Chinese characters, please see this link: https://www.qqxiuzi.cn/zh/hanzi-unicode-bianma.php.

utf-8, utf-16, utf-32 encoding formats

To talk about the relationship between unicode and utf-8, utf-16, utf-32, we must clarify encoding format = encoding + storage format. Among them, Encoding corresponds to the second step mentioned at the beginning of the article, and Storage Format corresponds to the third step. Unicode is just encoding, not storage format. And utf-x is an encoding format based on unicode encoding. Knowing this, we will also understand the relationship between utf-8, utf-16, and utf-32. They all use unicode encoding, so there must be differences in storage formats, so they have three names. Speaking of which, do you have a question? What are the encoding and storage format of GB2312, GBK, etc.? In fact, GB2312, GBK, etc. integrate encoding and storage format. Because the encoding of GB2312 is from A1A1~FEFE, it does not conflict with ASCII (0x00~0x7F) in the byte stream itself. When the bytes in the text are less than 0x80, it must be ASCII code. When the bytes in the text are larger than 0x80, this byte plus the next byte together represent a Chinese character. The Unicode encoding is different. Each byte in its character encoding can be smaller than 0x80, so does this byte represent ASCII or is it combined with other bytes to represent another character ( such as Chinese characters)? There is no way to determine, so there must be a storage format. Their encoding formats will be discussed below.

utf-8 encoding format

utf-8 official website: http://www.utf-8.com/

utf-8 (Unicode Transformation Format 8-bit) is a variable-width encoding format. Use 1~4 bytes to represent a Unicode character encoding. It can represent all characters in the unicode character set.
UTF-8 encoding format:

It is a variable-length byte (1~4 bytes) encoding method.
For the encoding format of a byte: bit[7] of byte bit[7:0] must be 0, and these 7 bits of bit[6:0] are used for encoding, that is, the binary system is 0b0xxxxxxx.
For the encoding format of n (n>1) bytes (2, 3, 4 bytes):
1. Use the bit[7:8-n] position of the first byte to be 1 and bit[8-n-1] to be 0 to indicate the number of bytes. For example, the first byte is 0b110xxxxx, indicating a two-byte encoding format. Similarly, 0b1110xxxx is in three-byte format. leftover
2. bit[7:6]=0b10 of the remaining bytes except the first byte. The remaining bits are used for encoding.
  For example, the code 0b1110xxxx,0b10xxxxxx,0b10xxxxxx represents a three-byte encoding format, where x represents the codable position. It can be calculated that the three-byte encoding format has a total of 2^24 encoding positions.
  For example, the table is a specific encoding format table of 1~4 bytes:

utf-8 n-byte encoding format	Number range
0b0xxxxxxx	0x00~0x7F
0b110xxxxx 0b10xxxxxx	0x80~0x7FF
0b1110xxxx 0b10xxxxxx 0b10xxxxxx	0x800~0xFFFF
0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx	0x10000~0x10FFFF

The above is the storage format, so how to fill the unicode character encoding into the utf-8 format?
The steps to convert unicode encoding into utf-8 encoding are as follows:

First find the number range in the table above where the unicode character character encoding is located. For example, the unicode encoding of character A is 0x41, which is in the number range of utf-8 1-byte encoding format (0~127).
Fill in the unicode character encoding binary bits from right to left into the x bits in the above table (note to remove the high-order 0). For example, if the character A (0x41=0b01010001) removes the high-order 0, it will be 0b1010001. After filling it in 0b0xxxxxxx, it will be 0b01010001. That is the encoding value of the utf-8 character A.
Another example: the unicode character encoding of the Chinese character Guo is 0x90ED (0b10010000,11101101), and its encoding is obtained from 0x800 < 0x90ED < 0xFFFF The format is a 3-byte encoding format of UTF-8. Then fill in its binary 0b10010000,11101101 into 0b1110xxxx 0b10xxxxxx 0b10xxxxxx and then it is 0b11101001,10000011,10101101=0xE983AD. That is, utf-8 charactersGuo.

With the relationship between utf-8 and unicode, it is easy to write a conversion program. The following is the conversion of utf-8 encoding to unicode encoding implemented in c language

/*
 * Convert utf-8 character encoding to unicode encoding. (2-byte mode)
 * pIn: the first address of the string to be converted
 * charsize: The number of bytes (1~3) to receive the returned utf8 characters
 * pOut: The obtained UCS2 encoding, pOut[0] is the first byte.
 */
int Utf8ToUCS2(const char *pIn,char *charsize,char *pOut)
{<!-- -->
    uint8_t firstValue = *pIn;
    char *pUCS2 = pOut;

    if(firstValue < 0x80){<!-- -->//0~127 ASCII 1byte
        pUCS2[0] = 0;
        pUCS2[1] = *pIn;
        *charsize = 1;
    }
    else if( (firstValue & amp; 0xE0) == 0xC0){<!-- -->//128~2047 2byte
        if( (pIn[1] & amp; 0xC0) != 0x80){<!-- -->
            return -1;
        }
        pUCS2[0] = (pIn[0] & amp; 0x1F) >> 2;
        pUCS2[1] = (pIn[0] << 6) + (pIn[1] & amp; 0x3F);
        *charsize = 2;
    }
    else if( (firstValue & amp; 0xF0) == 0xE0){<!-- -->//2048~65536 3byte
        if( (pIn[1] & amp; 0xC0) != 0x80 || (pIn[2] & amp; 0xC0) != 0x80){<!-- -->
            return -1;
        }
        pUCS2[0] = (pIn[0] << 4) + ((pIn[1] & amp; 0x3F) >> 2);
        pUCS2[1] = (pIn[1] << 6) + (pIn[2] & amp; 0x3F);
        *charsize = 3;
    }
    else {<!-- -->//>3byte is not processed, Chinese characters use less than 4 bytes of encoding
        *charsize = 0;
        return -1;
    }

    return 0;
}

Chinese character encoding conversion tool and encoding table

1. Qianqianxiuzi: https://www.qqxiuzi.cn/daohang.htm This website has coding-related tools such as conversion between codes, code value query, Chinese character code value table, etc.

About technical exchange

The text after this has nothing to do with the content of the question, so you don’t need to read it.
QQ group:825695030
WeChat public account: embedded daily life
If the above article is useful to you, please feel free to tip, like, and comment. QR code