# The knowledge of character sets (I believe you can understand)

Knowledge of character sets (I believe you can understand)

Article directory

The knowledge of character sets (I believe you can understand)
- 1 Character set organization and standards
- 2 Character sets and corresponding encoding methods, etc.
- - 2.1 Character Set Unicode 1.0 (UCS-2)
  - 2.2 Character Set Unicode 2.0 (UCS-4)
- 3 Unicode (UCS) code point values
- 4 Description of encoding method
- - 4.1 Encoding with UTF-16 (UCS-2), UTF-32 (UCS-4)
  - 4.2 Encoding with UTF-8
- 5 The difference between C.UTF-8 and en_US.UTF-8
- 6 last

1 Character set organization and standard

ASCII character set (American Standard Code for Information Interchange, American Standard Code for Information Interchange), which uses 7 bits to represent a character, representing a total of 128 characters. (IBM later expanded to 8bits, 256 characters)

The number of English letters plus special characters will not exceed 256, one byte is enough, but some other characters will not work, such as tens of thousands of Chinese characters, so various other character sets have appeared, so when different character sets exchange data There is a problem, maybe you use a certain number to represent the character A, but in another character set, this number does not represent A, so it will be troublesome to interact, so organizations such as Unicode and ISO appear to uniformly formulate a Standard, any certain number corresponds to only one character, the name taken by ISO is UCS (Universal Character Set), and the name taken by Unicode is called unicode.

Organization name	Character set standard	Character set	Remarks
United States	ASCII standard	ASCII (7 bits)	English characters are replaced by Unicode/UCS Compatible
Unicode	Unicode standard	Unicode 1.0 (2 bytes), Unicode 2.0 (4 bytes)	Universal for all languages
ISO	UCS (Universal Character Set) standard	UCS-2 (2 bytes ), UCS-4 (4 bytes)	Common to all languages
China	Special for Chinese characters	GB2312, GBk, GB18030 (GB18030-2000, GB18030-2005)	will eventually be eliminated, this article will not discuss in depth
Others		Other special character sets	will eventually be eliminated, this article will not discuss in depth

The encoding of Unicode and the ISO character set standard is exactly the same, and we generally call Unicode encoding more common.

2 character sets and corresponding encoding methods, etc.

2.1 Character Set Unicode 1.0 (UCS-2)

Use two bytes (16bit) to represent all characters, that is, up to 2 to the 16th power characters (65536 characters)
There are two character set encoding methods under Unicode version 1: UTF-8 and UTF-16, here is the character set encoding method, not the character set

Use these bytes at the beginning of the file to identify the encoding method of the file:

Unicode character set (ISO character set)	Unicode encoding method (ISO encoding method)	File start byte	Bytes to store
Unicode 1.0 (UCS-2)	UTF-8 (none)	EF BB BF	1-4
Unicode 1.0 (UCS-2)	UTF-16 (UCS-2)	FE FF	2

Note: UCS is the standard formulated by the ISO mentioned above. It is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, and UTF-8 has no corresponding UCS

2.2 Character Set Unicode 2.0 (UCS-4)

At first, Unicode used two bytes (16 bits) to encode characters. Later, when it was found to be insufficient (there are so many language characters in the world), it was extended to four bytes (32 bits) in 1996, corresponding to UCS- 4. Unicode version 2.0 uses four bytes to represent all characters.

Use these bytes at the beginning of the file to identify the encoding method of the file:

Unicode character set (ISO character set)	Unicode encoding method (ISO encoding method)	File start byte	Storage Occupied Bytes	Unicode 2.0 Added
Unicode 2.0 (UCS-4)	UTF-8 (none)	EF BB BF	1-7 (theoretically)
Unicode 2.0 (UCS-4)	UTF-16 (UCS-2, little endian)	FE FF	2
Unicode 2.0 (UCS-4)	UTF-16 (UCS-2, big endian)	FF FE	2	Yes
Unicode 2.0 (UCS-4)	UTF-32 (UCS-4, little endian)	FF FE 00 00	4	Yes
Unicode 2.0 (UCS-4)	UTF-32 (UCS-4, big-endian)	00 00 FE FF	4	is

Note:

UCS is the standard formulated by the ISO mentioned above, which is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, UCS-4 corresponds to UTF-32, and UTF-8 has no corresponding UCS.

The difference between Big endian and Little endian: Unicode codes can be directly stored in UCS-2 format. Taking the Chinese character “凯” as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E, and the other byte It’s 25. When storing, put 4E in front and 25 in the back, which is the Big endian method, which we generally call “big endian method”; . The big-endian method is friendly to human reading. In Unicode 1.0, only the small-endian method is available.

3 Unicode (UCS) code point values

A code point is a very important concept in the Unicode standard, and a code point corresponds to a character.

[Chinese] Unicode code point value is [0x6C49]

kevin@TM1701-b38cbc23:~$ echo -e '\汉'
Chinese

Note that a single abstract character may correspond to more than one code point. For example, Ω can not only represent the uppercase Greek letter Omega, the code point is U + 03A9, but also represent the ohm symbol in physics, the code point is U + 2126.

kevin@TM1701-b38cbc23:~$ echo -e '\Ω'
Ω
kevin@TM1701-b38cbc23:~$ echo -e '\?'
?

A single abstract character can also be represented by a sequence of code points. For example, the code point of é is U+00E9, it can also be written by the lowercase letter e (code point is U+0065) ' (Combining Acute Accent) (code point is U+0301).

kevin@TM1701-b38cbc23:~$ echo -e '\é'
e
kevin@TM1701-b38cbc23:~$ echo -e '\e\?'
e?

In the Unicode standard, code points are usually represented using their hexadecimal notation and prefixed with U + .

4 Encoding Description

Note:

Unicode 1.0, Unicode 2.0: is a character set

UCS-2, UCS-4: both a character set and an encoding method

UTF-16, UTF-32: It is an encoding method, which is equivalent to UCS-2 and UCS-4 encoding methods. Because it is a fixed byte length, there is no need to re-encode when storing, and you can directly store Unicode encoding.

UTF-8: It is an encoding method, which is a storage encoding method for storing Unicode character encodings. It has a non-fixed length. The length of bytes required to store the character is determined according to the length of the Unicode character encoding.

Why there is UTF-16, UTF-32 encoding, and UTF-8 encoding: because, for example, when Chinese and English or other characters are mixed, UTF-8 encoding can save a lot of storage space, and English uses UTF-8 When stored in encoding mode, it only needs to occupy 1 byte, while UTF-32 needs to occupy 4 bytes

The following is an example of [Chinese], the Unicode code point value of [Chinese] is [0x6C49], which is [01101100 01001001]

4.1 Use UTF-16 (UCS-2), UTF-32 (UCS-4) encoding

The UTF-16 (UCS-2) and UTF-32 (UCS-4) methods do not need to be re-encoded, and the Unicode code point value of the character can be used directly.

When using the UTF-16 (UCS-2) encoding method, [Chinese] can be directly stored as [01101100 01001001], and it will occupy 16 bits, two bytes. When the program is parsed, it is known that it is UTF-16 encoding, so just It is simple and clear to parse two bytes as a unit.
When using the UTF-32 (UCS-4) encoding method, the [Chinese] is directly stored as [00000000 00000000 01101100 01001001], which occupies 4 bytes, and can be filled with 0 in front.

4.2 Encoding with UTF-8

UTF-8 is a very amazing encoding method, which beautifully achieves backward compatibility with ASCII code and reduces space occupation, so as to ensure that Unicode can be accepted by the public.

UTF-8 is currently the most widely used Unicode encoding method on the Internet, and its biggest feature is variable length. It can use 1 – 7 bytes to represent a character, and the length can be changed according to the character. The encoding rules are as follows:

For single-byte characters, the first bit is set to 0, and the next 7 bits correspond to the Unicode code point of the character. Therefore, for characters 0 – 127 in English, it is exactly the same as the ASCII code. This means that documents from the era of ASCII codes can be opened with UTF-8 encoding without any problem.
For characters that need to be represented by N bytes (N > 1), the first N bits of the first byte are all set to 1, the N + 1 bit is set to 0, and the remaining N – 1 bytes of the first Both bits are set to 10, and the remaining bits are filled with the character’s Unicode code point value.

The encoding rules are as follows (x is a valid code point):

Unicode encoding length	Unicode hexadecimal code point range	UTF-8 binary	Description
16 bits	0000 0000 – 0000 007F	0xxxxxxx	For compatibility with ASCII
16 bits	0000 0080 – 0000 07FF	110xxxxx 10xxxxxx	need two byte, so the first byte is preceded by ‘110’
16 digits	0000 0800 – 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx	…
16 bits	0001 0000 – 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	requires two bytes, so the first byte is preceded by ‘11110’
24 bits		…and so on…
32-bit		…and so on…

The Unicode code point value of [汉] is [01101100 01001001]. Using the above encoding rule, remove the leading [0], the Unicode code point value needs to occupy 15 bits, and it is concluded that [汉] needs to occupy three bytes, so after encoding , his final UTF-8 encoding is【1110 0110 10 110001 10 001001 】

5 The difference between C.UTF-8 and en_US.UTF-8

C.UTF-8: Generally speaking, C is suitable for computers, and C represents the default language environment compatible with the POSIX standard. Only strict ASCII characters are valid. Extended After allowing the basic use of UTF-8; UTF-8 stands for character set and encoding. This one is generally used in database environments

en_US.UTF-8: en_US stands for American English; UTF-8 stands for character set and encoding

zh_CN.UTF-8: zh_CN stands for Chinese in mainland China; UTF-8 stands for character set and encoding

These affect how they differ in sort order, case relations, sort order, thousands separator, default currency symbol, etc.

6 Last

Love you