# The knowledge of character sets (I believe you can understand)

Knowledge of character sets (I believe you can understand)

Article directory

  • The knowledge of character sets (I believe you can understand)
    • 1 Character set organization and standards
    • 2 Character sets and corresponding encoding methods, etc.
      • 2.1 Character Set Unicode 1.0 (UCS-2)
      • 2.2 Character Set Unicode 2.0 (UCS-4)
    • 3 Unicode (UCS) code point values
    • 4 Description of encoding method
      • 4.1 Encoding with UTF-16 (UCS-2), UTF-32 (UCS-4)
      • 4.2 Encoding with UTF-8
    • 5 The difference between C.UTF-8 and en_US.UTF-8
    • 6 last

1 Character set organization and standard

ASCII character set (American Standard Code for Information Interchange, American Standard Code for Information Interchange), which uses 7 bits to represent a character, representing a total of 128 characters. (IBM later expanded to 8bits, 256 characters)

The number of English letters plus special characters will not exceed 256, one byte is enough, but some other characters will not work, such as tens of thousands of Chinese characters, so various other character sets have appeared, so when different character sets exchange data There is a problem, maybe you use a certain number to represent the character A, but in another character set, this number does not represent A, so it will be troublesome to interact, so organizations such as Unicode and ISO appear to uniformly formulate a Standard, any certain number corresponds to only one character, the name taken by ISO is UCS (Universal Character Set), and the name taken by Unicode is called unicode.

Organization name Character set standard Character set Remarks
United States ASCII standard ASCII (7 bits) English characters are replaced by Unicode/UCS Compatible
Unicode Unicode standard Unicode 1.0 (2 bytes), Unicode 2.0 (4 bytes) Universal for all languages
ISO UCS (Universal Character Set) standard UCS-2 (2 bytes ), UCS-4 (4 bytes) Common to all languages
China Special for Chinese characters GB2312, GBk, GB18030 (GB18030-2000, GB18030-2005) will eventually be eliminated, this article will not discuss in depth
Others Other special character sets will eventually be eliminated, this article will not discuss in depth

The encoding of Unicode and the ISO character set standard is exactly the same, and we generally call Unicode encoding more common.

2 character sets and corresponding encoding methods, etc.

2.1 Character Set Unicode 1.0 (UCS-2)

Use two bytes (16bit) to represent all characters, that is, up to 2 to the 16th power characters (65536 characters)
There are two character set encoding methods under Unicode version 1: UTF-8 and UTF-16, here is the character set encoding method, not the character set

Use these bytes at the beginning of the file to identify the encoding method of the file:

Unicode character set (ISO character set) Unicode encoding method (ISO encoding method) File start byte Bytes to store
Unicode 1.0 (UCS-2) UTF-8 (none) EF BB BF 1-4
Unicode 1.0 (UCS-2) UTF-16 (UCS-2) FE FF 2

Note: UCS is the standard formulated by the ISO mentioned above. It is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, and UTF-8 has no corresponding UCS

2.2 Character Set Unicode 2.0 (UCS-4)

At first, Unicode used two bytes (16 bits) to encode characters. Later, when it was found to be insufficient (there are so many language characters in the world), it was extended to four bytes (32 bits) in 1996, corresponding to UCS- 4. Unicode version 2.0 uses four bytes to represent all characters.

Use these bytes at the beginning of the file to identify the encoding method of the file:

Unicode character set (ISO character set) Unicode encoding method (ISO encoding method) File start byte Storage Occupied Bytes Unicode 2.0 Added
Unicode 2.0 (UCS-4) UTF-8 (none) EF BB BF 1-7 (theoretically)
Unicode 2.0 (UCS-4) UTF-16 (UCS-2, little endian) FE FF 2
Unicode 2.0 (UCS-4) UTF-16 (UCS-2, big endian) FF FE 2 Yes
Unicode 2.0 (UCS-4) UTF-32 (UCS-4, little endian) FF FE 00 00 4 Yes
Unicode 2.0 (UCS-4) UTF-32 (UCS-4, big-endian) 00 00 FE FF 4 is

Note:

  • UCS is the standard formulated by the ISO mentioned above, which is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, UCS-4 corresponds to UTF-32, and UTF-8 has no corresponding UCS.

  • The difference between Big endian and Little endian: Unicode codes can be directly stored in UCS-2 format. Taking the Chinese character “凯” as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E, and the other byte It’s 25. When storing, put 4E in front and 25 in the back, which is the Big endian method, which we generally call “big endian method”; . The big-endian method is friendly to human reading. In Unicode 1.0, only the small-endian method is available.

3 Unicode (UCS) code point values

A code point is a very important concept in the Unicode standard, and a code point corresponds to a character.

  • [Chinese] Unicode code point value is [0x6C49]
kevin@TM1701-b38cbc23:~$ echo -e '\汉'
Chinese
  • Note that a single abstract character may correspond to more than one code point. For example, Ω can not only represent the uppercase Greek letter Omega, the code point is U + 03A9, but also represent the ohm symbol in physics, the code point is U + 2126.
kevin@TM1701-b38cbc23:~$ echo -e '\Ω'
Ω
kevin@TM1701-b38cbc23:~$ echo -e '\?'
?
  • A single abstract character can also be represented by a sequence of code points. For example, the code point of é is U+00E9, it can also be written by the lowercase letter e (code point is U+0065) ' (Combining Acute Accent) (code point is U+0301).
kevin@TM1701-b38cbc23:~$ echo -e '\é'
e
kevin@TM1701-b38cbc23:~$ echo -e '\e\?'
e?

In the Unicode standard, code points are usually represented using their hexadecimal notation and prefixed with U + .

4 Encoding Description

Note:

  • Unicode 1.0, Unicode 2.0: is a character set

  • UCS-2, UCS-4: both a character set and an encoding method

  • UTF-16, UTF-32: It is an encoding method, which is equivalent to UCS-2 and UCS-4 encoding methods. Because it is a fixed byte length, there is no need to re-encode when storing, and you can directly store Unicode encoding.

  • UTF-8: It is an encoding method, which is a storage encoding method for storing Unicode character encodings. It has a non-fixed length. The length of bytes required to store the character is determined according to the length of the Unicode character encoding.

  • Why there is UTF-16, UTF-32 encoding, and UTF-8 encoding: because, for example, when Chinese and English or other characters are mixed, UTF-8 encoding can save a lot of storage space, and English uses UTF-8 When stored in encoding mode, it only needs to occupy 1 byte, while UTF-32 needs to occupy 4 bytes

The following is an example of [Chinese], the Unicode code point value of [Chinese] is [0x6C49], which is [01101100 01001001]

4.1 Use UTF-16 (UCS-2), UTF-32 (UCS-4) encoding

The UTF-16 (UCS-2) and UTF-32 (UCS-4) methods do not need to be re-encoded, and the Unicode code point value of the character can be used directly.

  • When using the UTF-16 (UCS-2) encoding method, [Chinese] can be directly stored as [01101100 01001001], and it will occupy 16 bits, two bytes. When the program is parsed, it is known that it is UTF-16 encoding, so just It is simple and clear to parse two bytes as a unit.

  • When using the UTF-32 (UCS-4) encoding method, the [Chinese] is directly stored as [00000000 00000000 01101100 01001001], which occupies 4 bytes, and can be filled with 0 in front.

4.2 Encoding with UTF-8

UTF-8 is a very amazing encoding method, which beautifully achieves backward compatibility with ASCII code and reduces space occupation, so as to ensure that Unicode can be accepted by the public.

UTF-8 is currently the most widely used Unicode encoding method on the Internet, and its biggest feature is variable length. It can use 1 – 7 bytes to represent a character, and the length can be changed according to the character. The encoding rules are as follows:

  1. For single-byte characters, the first bit is set to 0, and the next 7 bits correspond to the Unicode code point of the character. Therefore, for characters 0 – 127 in English, it is exactly the same as the ASCII code. This means that documents from the era of ASCII codes can be opened with UTF-8 encoding without any problem.

  2. For characters that need to be represented by N bytes (N > 1), the first N bits of the first byte are all set to 1, the N + 1 bit is set to 0, and the remaining N – 1 bytes of the first Both bits are set to 10, and the remaining bits are filled with the character’s Unicode code point value.

The encoding rules are as follows (x is a valid code point):

Unicode encoding length Unicode hexadecimal code point range UTF-8 binary Description
16 bits 0000 0000 – 0000 007F 0xxxxxxx For compatibility with ASCII
16 bits 0000 0080 – 0000 07FF 110xxxxx 10xxxxxx need two byte, so the first byte is preceded by ‘110’
16 digits 0000 0800 – 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
16 bits 0001 0000 – 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx requires two bytes, so the first byte is preceded by ‘11110’
24 bits …and so on…
32-bit …and so on…

The Unicode code point value of [汉] is [01101100 01001001]. Using the above encoding rule, remove the leading [0], the Unicode code point value needs to occupy 15 bits, and it is concluded that [汉] needs to occupy three bytes, so after encoding , his final UTF-8 encoding is【1110 0110 10 110001 10 001001

5 The difference between C.UTF-8 and en_US.UTF-8

C.UTF-8: Generally speaking, C is suitable for computers, and C represents the default language environment compatible with the POSIX standard. Only strict ASCII characters are valid. Extended After allowing the basic use of UTF-8; UTF-8 stands for character set and encoding. This one is generally used in database environments

en_US.UTF-8: en_US stands for American English; UTF-8 stands for character set and encoding

zh_CN.UTF-8: zh_CN stands for Chinese in mainland China; UTF-8 stands for character set and encoding

These affect how they differ in sort order, case relations, sort order, thousands separator, default currency symbol, etc.

6 Last

Love you