Knowledge of character sets (I believe you can understand)
Article directory
- The knowledge of character sets (I believe you can understand)
-
- 1 Character set organization and standards
- 2 Character sets and corresponding encoding methods, etc.
-
- 2.1 Character Set Unicode 1.0 (UCS-2)
- 2.2 Character Set Unicode 2.0 (UCS-4)
- 3 Unicode (UCS) code point values
- 4 Description of encoding method
-
- 4.1 Encoding with UTF-16 (UCS-2), UTF-32 (UCS-4)
- 4.2 Encoding with UTF-8
- 5 The difference between C.UTF-8 and en_US.UTF-8
- 6 last
1 Character set organization and standard
ASCII character set (American Standard Code for Information Interchange, American Standard Code for Information Interchange), which uses 7 bits to represent a character, representing a total of 128 characters. (IBM later expanded to 8bits, 256 characters)
The number of English letters plus special characters will not exceed 256, one byte is enough, but some other characters will not work, such as tens of thousands of Chinese characters, so various other character sets have appeared, so when different character sets exchange data There is a problem, maybe you use a certain number to represent the character A, but in another character set, this number does not represent A, so it will be troublesome to interact, so organizations such as Unicode and ISO appear to uniformly formulate a Standard, any certain number corresponds to only one character, the name taken by ISO is UCS (Universal Character Set), and the name taken by Unicode is called unicode.
Organization name | Character set standard | Character set | Remarks |
---|---|---|---|
United States | ASCII standard | ASCII (7 bits) | English characters are replaced by Unicode/UCS Compatible |
Unicode | Unicode standard | Unicode 1.0 (2 bytes), Unicode 2.0 (4 bytes) | Universal for all languages |
ISO | UCS (Universal Character Set) standard | UCS-2 (2 bytes ), UCS-4 (4 bytes) | Common to all languages |
China | Special for Chinese characters | GB2312, GBk, GB18030 (GB18030-2000, GB18030-2005) | will eventually be eliminated, this article will not discuss in depth |
Others | Other special character sets | will eventually be eliminated, this article will not discuss in depth |
The encoding of Unicode and the ISO character set standard is exactly the same, and we generally call Unicode encoding more common.
2 character sets and corresponding encoding methods, etc.
2.1 Character Set Unicode 1.0 (UCS-2)
Use two bytes (16bit) to represent all characters, that is, up to 2 to the 16th power characters (65536 characters)
There are two character set encoding methods under Unicode version 1: UTF-8 and UTF-16, here is the character set encoding method, not the character set
Use these bytes at the beginning of the file to identify the encoding method of the file:
Unicode character set (ISO character set) | Unicode encoding method (ISO encoding method) | File start byte | Bytes to store |
---|---|---|---|
Unicode 1.0 (UCS-2) | UTF-8 (none) | EF BB BF | 1-4 |
Unicode 1.0 (UCS-2) | UTF-16 (UCS-2) | FE FF | 2 |
Note: UCS is the standard formulated by the ISO mentioned above. It is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, and UTF-8 has no corresponding UCS
2.2 Character Set Unicode 2.0 (UCS-4)
At first, Unicode used two bytes (16 bits) to encode characters. Later, when it was found to be insufficient (there are so many language characters in the world), it was extended to four bytes (32 bits) in 1996, corresponding to UCS- 4. Unicode version 2.0 uses four bytes to represent all characters.
Use these bytes at the beginning of the file to identify the encoding method of the file:
Unicode character set (ISO character set) | Unicode encoding method (ISO encoding method) | File start byte | Storage Occupied Bytes | Unicode 2.0 Added |
---|---|---|---|---|
Unicode 2.0 (UCS-4) | UTF-8 (none) | EF BB BF | 1-7 (theoretically) | |
Unicode 2.0 (UCS-4) | UTF-16 (UCS-2, little endian) | FE FF | 2 | |
Unicode 2.0 (UCS-4) | UTF-16 (UCS-2, big endian) | FF FE | 2 | Yes |
Unicode 2.0 (UCS-4) | UTF-32 (UCS-4, little endian) | FF FE 00 00 | 4 | Yes |
Unicode 2.0 (UCS-4) | UTF-32 (UCS-4, big-endian) | 00 00 FE FF | 4 | is |
Note:
UCS is the standard formulated by the ISO mentioned above, which is exactly the same as Unicode, but the name is different. UCS-2 corresponds to UTF-16, UCS-4 corresponds to UTF-32, and UTF-8 has no corresponding UCS.
The difference between Big endian and Little endian: Unicode codes can be directly stored in UCS-2 format. Taking the Chinese character “凯” as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E, and the other byte It’s 25. When storing, put 4E in front and 25 in the back, which is the Big endian method, which we generally call “big endian method”; . The big-endian method is friendly to human reading. In Unicode 1.0, only the small-endian method is available.
3 Unicode (UCS) code point values
A code point is a very important concept in the Unicode standard, and a code point corresponds to a character.
- [Chinese] Unicode code point value is [0x6C49]
kevin@TM1701-b38cbc23:~$ echo -e '\汉' Chinese
- Note that a single abstract character may correspond to more than one code point. For example,
Ω
can not only represent the uppercase Greek letter Omega, the code point isU + 03A9
, but also represent the ohm symbol in physics, the code point isU + 2126
.
kevin@TM1701-b38cbc23:~$ echo -e '\Ω' Ω kevin@TM1701-b38cbc23:~$ echo -e '\?' ?
- A single abstract character can also be represented by a sequence of code points. For example, the code point of
é
isU+00E9
, it can also be written by the lowercase lettere
(code point isU+0065
)'
(Combining Acute Accent) (code point isU+0301
).
kevin@TM1701-b38cbc23:~$ echo -e '\é' e kevin@TM1701-b38cbc23:~$ echo -e '\e\?' e?
In the Unicode standard, code points are usually represented using their hexadecimal notation and prefixed with U +
.
4 Encoding Description
Note:
Unicode 1.0, Unicode 2.0: is a character set
UCS-2, UCS-4: both a character set and an encoding method
UTF-16, UTF-32: It is an encoding method, which is equivalent to UCS-2 and UCS-4 encoding methods. Because it is a fixed byte length, there is no need to re-encode when storing, and you can directly store Unicode encoding.
UTF-8: It is an encoding method, which is a storage encoding method for storing Unicode character encodings. It has a non-fixed length. The length of bytes required to store the character is determined according to the length of the Unicode character encoding.
Why there is UTF-16, UTF-32 encoding, and UTF-8 encoding: because, for example, when Chinese and English or other characters are mixed, UTF-8 encoding can save a lot of storage space, and English uses UTF-8 When stored in encoding mode, it only needs to occupy 1 byte, while UTF-32 needs to occupy 4 bytes
The following is an example of [Chinese], the Unicode code point value of [Chinese] is [0x6C49], which is [01101100 01001001]
4.1 Use UTF-16 (UCS-2), UTF-32 (UCS-4) encoding
The UTF-16 (UCS-2) and UTF-32 (UCS-4) methods do not need to be re-encoded, and the Unicode code point value of the character can be used directly.
-
When using the UTF-16 (UCS-2) encoding method, [Chinese] can be directly stored as [01101100 01001001], and it will occupy 16 bits, two bytes. When the program is parsed, it is known that it is UTF-16 encoding, so just It is simple and clear to parse two bytes as a unit.
-
When using the UTF-32 (UCS-4) encoding method, the [Chinese] is directly stored as [00000000 00000000 01101100 01001001], which occupies 4 bytes, and can be filled with 0 in front.
4.2 Encoding with UTF-8
UTF-8 is a very amazing encoding method, which beautifully achieves backward compatibility with ASCII code and reduces space occupation, so as to ensure that Unicode can be accepted by the public.
UTF-8 is currently the most widely used Unicode encoding method on the Internet, and its biggest feature is variable length. It can use 1 – 7 bytes to represent a character, and the length can be changed according to the character. The encoding rules are as follows:
-
For single-byte characters, the first bit is set to 0, and the next 7 bits correspond to the Unicode code point of the character. Therefore, for characters 0 – 127 in English, it is exactly the same as the ASCII code. This means that documents from the era of ASCII codes can be opened with UTF-8 encoding without any problem.
-
For characters that need to be represented by N bytes (N > 1), the first N bits of the first byte are all set to 1, the N + 1 bit is set to 0, and the remaining N – 1 bytes of the first Both bits are set to 10, and the remaining bits are filled with the character’s Unicode code point value.
The encoding rules are as follows (x is a valid code point):
Unicode encoding length | Unicode hexadecimal code point range | UTF-8 binary | Description |
---|---|---|---|
16 bits | 0000 0000 – 0000 007F | 0xxxxxxx | For compatibility with ASCII |
16 bits | 0000 0080 – 0000 07FF | 110xxxxx 10xxxxxx | need two byte, so the first byte is preceded by ‘110’ |
16 digits | 0000 0800 – 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx | … |
16 bits | 0001 0000 – 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | requires two bytes, so the first byte is preceded by ‘11110’ |
24 bits | …and so on… | ||
32-bit | …and so on… |
The Unicode code point value of [汉] is [01101100 01001001]. Using the above encoding rule, remove the leading [0], the Unicode code point value needs to occupy 15 bits, and it is concluded that [汉] needs to occupy three bytes, so after encoding , his final UTF-8 encoding is【1110 0110 10 110001 10 001001 】
5 The difference between C.UTF-8 and en_US.UTF-8
C.UTF-8
: Generally speaking, C
is suitable for computers, and C represents the default language environment compatible with the POSIX standard. Only strict ASCII characters are valid. Extended After allowing the basic use of UTF-8; UTF-8 stands for character set and encoding. This one is generally used in database environments
en_US.UTF-8
: en_US stands for American English; UTF-8 stands for character set and encoding
zh_CN.UTF-8
: zh_CN stands for Chinese in mainland China; UTF-8 stands for character set and encoding
These affect how they differ in sort order, case relations, sort order, thousands separator, default currency symbol, etc.
6 Last
Love you