A pot of coding knowledge (ASCII, ISO-8859-1 (Latin1), GB2312, GBK, GB18030, BIG5, UTF-8, UTF-16, UTF-32, Bom, garbled code)

Coding knowledge stewed in a pot

Coding history

Era	Encoding	Length	Description	Problem
Machine Code Era	01 Binary	8bit	Generally, 1 represents high level and 0 represents low level	People cannot understand it
The Age of Enlightenment	ASCII	Single Byte	American Standard Information Interchange Code, the most common today A single-byte encoding system, the first bit of the binary bit of the ASCII code value (ie, the highest bit b7) is 0, which is used for parity check. The other seven binary digits represent all uppercase and lowercase letters, the numbers 0 to 9, punctuation marks, and special control characters used in American English.	Not comprehensive
Development era	ISO-8859-1(Latin1)	Single byte	European early standards, characters between 128-255 are used for encoding special language characters in the Latin alphabet, backward compatible with ASCII, its encoding range is 0x00-0xFF (0-255), 0x00 -0x7F (0-127) is completely consistent with ASCII, 0x80-0x9F (128-159) are control characters, and 0xA0-0xFF (160-255) are text symbols.	The expansion of other languages is not considered
Prosperous Era	GB2312	Double Byte	The Chinese character internal code specification formulated by the country in 1980. GB is the pinyin abbreviation of “national standard”, and 2312 is the national standard serial number. The GB2312 standard contains a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters. At the same time, GB2312 includes 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.	Not comprehensive about rare characters in names, ancient Chinese, etc.
	GBK	Double-byte	In 1995, the GBK “Chinese Character Internal Code Expansion Specification” was formulated. GBK is compatible with the GB2312 standard. The GBK standard contains 21,003 Chinese characters and symbols. These 6,763 characters in GB2312 are usually called “common characters”, while the Chinese characters included in GBK but not in the GB2312 character set are called “Uncommon words”.	Still not comprehensive for the Chinese character system
	GB18030	Double Byte	Enacted in 2000 Standard GB18030-2000 “Chinese Coded Character Set for Information Exchange* Supplement to this Collection” was released. GB18030 is a very large Chinese coded character set based on Chinese characters and including a variety of Chinese minority characters (such as Tibetan, Mongolian, Dai, Yi, Korean, Uyghur, etc.), including more than 70,000 Chinese characters.	Because the code bits are insufficient, GB18030 uses a mixed encoding method of 2byte and 4byte, which adds difficulties to the software, so GB18030 has not been widely used.
	BIG5	Double-byte	A traditional Chinese character encoding scheme for Taiwan and Hong Kong. The Big Five code was planned and formulated by the Information Policy Council in 1984. It has 13,053 Chinese characters, 408 characters and 33 control characters. It is the industry standard for early Chinese computers in my country and is also the most commonly used electronic code among the Chinese community. Chinese character set standard.	There are limitations in the range of expression
Unification Era	Unicode	Double-byte	Unicode (Unicode, Universal Code, Unicode) is a character encoding scheme developed by an international organization that can accommodate all texts and symbols in the world. It is set for each character in each language. A unified and unique binary encoding	The high byte 0 is useless for the most commonly used ASCII
	UTF-8 (Unicode)	Variable	Unicode’s variable-length character encoding. UTF-8 is the most widely used Unicode encoding method, and its biggest feature is variable length. It uses 1-4 bytes to represent a character depending on the character. Very space-saving for English systems	Variable length is very painful for operating strings through subscripts
	UTF-16 (Unicode)
	UTF-16 (Unicode)	Variable	The length is relatively fixed, each Unicode code point is represented by 16 bits or 2 bytes, and the excess is represented by two UTF-16 or 4 bytes	There is no unified representation of UTF-16 encoded character types. In terms of code transplantation, the length will be different on different platforms
UTF-32 (Unicode)	Four bytes	The length is always fixed , each Unicode code point is represented by 32 bits or 4 bytes	It is a waste of space for the English system

UTF-8 variable length encoding rules

For single-byte characters, the first bit is set to 0, and the next 7 bits correspond to the Unicode code point of this character. Therefore, for characters 0-127 in English, they are exactly the same as ASCII codes. This means that there is no problem opening documents from that era with ASCII encoding using UTF-8 encoding.
For characters that need to be represented by N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1th bit is set to 0, and the remaining N – 1 bytes are set to 1. Both bits are set to 10, and the remaining bits are filled with the character’s Unicode code point.

Unicode range	UTF-8 representation
00000000 – 000000F	0xxxxxxx
00000080 – 000007FF	110xxxxx 10xxxxxx
00000800 – 0000FFFF	1110xxxx 10xxxxxx 10xxxxxx
00010000 – 0010FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Bom and little-endian, big-endian

Since Unicode uses more than two bytes for storage, there is an order problem in physical storage: Are the high-order bytes stored in low-order or high-order memory addresses? / Are the low-order bytes stored in the low-order or high-order memory address? As a result, big-endian (big-endian byte order) and little-endian (little-endian byte order) methods were produced:

big-endian: The high-order bytes are stored in the low-order memory address.
little-endian: The low-order bytes are stored at the low-order memory address.

How does a computer know which byte order a certain file uses?

The Unicode specification defines that a character indicating the encoding sequence is added to the front of each file. The name of this character is “zero width no-break space” (zero width no-break space), represented by FEFF. This is exactly two bytes, and FF is one greater than FE. If the first two bytes of a text file are FE FF, it means that the file uses big-endian byte order; if the first two bytes are FF FE, it means that the file uses little-endian byte order, and This is Bom.

Causes and solutions to garbled characters

The encoding encountered an error beyond its parsing range. Convert it to Unicode escape characters (\uxxxx)

/**
* Get the unicode escape character of the character
* @param source
* @return
*/
public static String getUnicode(String source) {<!-- -->
    String returnUniCodeTemp = null;
    String uniCodeTemp = null;
    for (int i = 0; i < source.length(); i + + ) {<!-- -->
        //Use the charAt() method of the char class
        uniCodeTemp = "\u00" + Integer.toHexString((int) source.charAt(i));
        returnUniCodeTemp = returnUniCodeTemp == null ? uniCodeTemp : returnUniCodeTemp + uniCodeTemp;
    }
    return returnUniCodeTemp;
}

Provide a method for accurately extracting multiple unicodes individually when they form a whole expression:

public static void getComblineUnicode(String str) {<!-- -->
        List<String> nikenameUnicodes = new ArrayList<>();
        System.out.println("-------------------------");
        System.out.println("raw = " + str);

        for (int i = 0; i < str.length(); i + + ) {<!-- -->
            char c = str.charAt(i);

            if (Character.isHighSurrogate(c)) {<!-- -->
                // Check if it is composed of two Unicode
                if (i < str.length() - 1 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 1))) {<!-- -->
                    System.out.println("Two Unicodes form a whole:" + c + str.charAt(i + 1));
                    StringBuilder s = new StringBuilder();
                    s.append(c);
                    s.append(str.charAt(i + 1));
                    nikenameUnicodes.add(s.toString());
                    i + + ; // The current character has been processed, skip the next character
                } else {<!-- -->
                    System.out.println("Undetermined Unicode composition:" + c);
                }
            } else if (Character.isSurrogate(c)) {<!-- -->
                // Check if it is composed of four Unicode characters
                if (i < str.length() - 1 & amp; & amp; Character.isHighSurrogate(str.charAt(i + 1))) {<!-- -->
                    if (i < str.length() - 2 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 2))) {<!-- -->
                        if (i < str.length() - 3 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 3))) {<!-- -->
                            System.out.println("Four Unicodes form a whole:" + c + str.charAt(i + 1) + str.charAt(i + 2) + str.charAt(i + 3));
                            StringBuilder s = new StringBuilder();
                            s.append(c);
                            s.append(str.charAt(i + 1));
                            s.append(str.charAt(i + 2));
                            s.append(str.charAt(i + 3));
                            nikenameUnicodes.add(s.toString());
                            i + = 3; // The current character has been processed, skip the next three characters
                        }
                    }
                } else {<!-- -->
                    System.out.println("Undetermined Unicode composition:" + c);
                    nikenameUnicodes.add("*");
                }
            } else {<!-- -->
                //Single Unicode
                System.out.println("Single Unicode:" + c);
                nikenameUnicodes.add(String.valueOf(c));
            }
        }

        System.out.println("The first complete Unicode character:" + nikenameUnicodes.get(0));
        System.out.println("The last complete Unicode character:" + nikenameUnicodes.get(nikenameUnicodes.size() - 1));
    }