A pot of coding knowledge (ASCII, ISO-8859-1 (Latin1), GB2312, GBK, GB18030, BIG5, UTF-8, UTF-16, UTF-32, Bom, garbled code)

Coding knowledge stewed in a pot

Coding history

Era Encoding Length Description Problem
Machine Code Era 01 Binary 8bit Generally, 1 represents high level and 0 represents low level People cannot understand it
The Age of Enlightenment ASCII Single Byte American Standard Information Interchange Code, the most common today A single-byte encoding system, the first bit of the binary bit of the ASCII code value (ie, the highest bit b7) is 0, which is used for parity check. The other seven binary digits represent all uppercase and lowercase letters, the numbers 0 to 9, punctuation marks, and special control characters used in American English. Not comprehensive
Development era ISO-8859-1(Latin1) Single byte European early standards, characters between 128-255 are used for encoding special language characters in the Latin alphabet, backward compatible with ASCII, its encoding range is 0x00-0xFF (0-255), 0x00 -0x7F (0-127) is completely consistent with ASCII, 0x80-0x9F (128-159) are control characters, and 0xA0-0xFF (160-255) are text symbols. The expansion of other languages is not considered
Prosperous Era GB2312 Double Byte The Chinese character internal code specification formulated by the country in 1980. GB is the pinyin abbreviation of “national standard”, and 2312 is the national standard serial number. The GB2312 standard contains a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters. At the same time, GB2312 includes 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters. Not comprehensive about rare characters in names, ancient Chinese, etc.
GBK Double-byte In 1995, the GBK “Chinese Character Internal Code Expansion Specification” was formulated. GBK is compatible with the GB2312 standard. The GBK standard contains 21,003 Chinese characters and symbols. These 6,763 characters in GB2312 are usually called “common characters”, while the Chinese characters included in GBK but not in the GB2312 character set are called “Uncommon words”. Still not comprehensive for the Chinese character system
GB18030 Double Byte Enacted in 2000 Standard GB18030-2000 “Chinese Coded Character Set for Information Exchange* Supplement to this Collection” was released. GB18030 is a very large Chinese coded character set based on Chinese characters and including a variety of Chinese minority characters (such as Tibetan, Mongolian, Dai, Yi, Korean, Uyghur, etc.), including more than 70,000 Chinese characters. Because the code bits are insufficient, GB18030 uses a mixed encoding method of 2byte and 4byte, which adds difficulties to the software, so GB18030 has not been widely used.
BIG5 Double-byte A traditional Chinese character encoding scheme for Taiwan and Hong Kong. The Big Five code was planned and formulated by the Information Policy Council in 1984. It has 13,053 Chinese characters, 408 characters and 33 control characters. It is the industry standard for early Chinese computers in my country and is also the most commonly used electronic code among the Chinese community. Chinese character set standard. There are limitations in the range of expression
Unification Era Unicode Double-byte Unicode (Unicode, Universal Code, Unicode) is a character encoding scheme developed by an international organization that can accommodate all texts and symbols in the world. It is set for each character in each language. A unified and unique binary encoding The high byte 0 is useless for the most commonly used ASCII
UTF-8 (Unicode) Variable Unicode’s variable-length character encoding. UTF-8 is the most widely used Unicode encoding method, and its biggest feature is variable length. It uses 1-4 bytes to represent a character depending on the character. Very space-saving for English systems Variable length is very painful for operating strings through subscripts
UTF-16 (Unicode)
UTF-16 (Unicode) Variable The length is relatively fixed, each Unicode code point is represented by 16 bits or 2 bytes, and the excess is represented by two UTF-16 or 4 bytes There is no unified representation of UTF-16 encoded character types. In terms of code transplantation, the length will be different on different platforms
UTF-32 (Unicode) Four bytes The length is always fixed , each Unicode code point is represented by 32 bits or 4 bytes It is a waste of space for the English system

UTF-8 variable length encoding rules

  • For single-byte characters, the first bit is set to 0, and the next 7 bits correspond to the Unicode code point of this character. Therefore, for characters 0-127 in English, they are exactly the same as ASCII codes. This means that there is no problem opening documents from that era with ASCII encoding using UTF-8 encoding.
  • For characters that need to be represented by N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1th bit is set to 0, and the remaining N – 1 bytes are set to 1. Both bits are set to 10, and the remaining bits are filled with the character’s Unicode code point.
Unicode range UTF-8 representation
00000000 – 000000F 0xxxxxxx
00000080 – 000007FF 110xxxxx 10xxxxxx
00000800 – 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000 – 0010FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Bom and little-endian, big-endian

Since Unicode uses more than two bytes for storage, there is an order problem in physical storage: Are the high-order bytes stored in low-order or high-order memory addresses? / Are the low-order bytes stored in the low-order or high-order memory address? As a result, big-endian (big-endian byte order) and little-endian (little-endian byte order) methods were produced:

  • big-endian: The high-order bytes are stored in the low-order memory address.
  • little-endian: The low-order bytes are stored at the low-order memory address.

How does a computer know which byte order a certain file uses?

The Unicode specification defines that a character indicating the encoding sequence is added to the front of each file. The name of this character is “zero width no-break space” (zero width no-break space), represented by FEFF. This is exactly two bytes, and FF is one greater than FE. If the first two bytes of a text file are FE FF, it means that the file uses big-endian byte order; if the first two bytes are FF FE, it means that the file uses little-endian byte order, and This is Bom.

Causes and solutions to garbled characters

The encoding encountered an error beyond its parsing range. Convert it to Unicode escape characters (\uxxxx)

/**
* Get the unicode escape character of the character
* @param source
* @return
*/
public static String getUnicode(String source) {<!-- -->
    String returnUniCodeTemp = null;
    String uniCodeTemp = null;
    for (int i = 0; i < source.length(); i + + ) {<!-- -->
        //Use the charAt() method of the char class
        uniCodeTemp = "\u00" + Integer.toHexString((int) source.charAt(i));
        returnUniCodeTemp = returnUniCodeTemp == null ? uniCodeTemp : returnUniCodeTemp + uniCodeTemp;
    }
    return returnUniCodeTemp;
}

Provide a method for accurately extracting multiple unicodes individually when they form a whole expression:

public static void getComblineUnicode(String str) {<!-- -->
        List<String> nikenameUnicodes = new ArrayList<>();
        System.out.println("-------------------------");
        System.out.println("raw = " + str);

        for (int i = 0; i < str.length(); i + + ) {<!-- -->
            char c = str.charAt(i);

            if (Character.isHighSurrogate(c)) {<!-- -->
                // Check if it is composed of two Unicode
                if (i < str.length() - 1 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 1))) {<!-- -->
                    System.out.println("Two Unicodes form a whole:" + c + str.charAt(i + 1));
                    StringBuilder s = new StringBuilder();
                    s.append(c);
                    s.append(str.charAt(i + 1));
                    nikenameUnicodes.add(s.toString());
                    i + + ; // The current character has been processed, skip the next character
                } else {<!-- -->
                    System.out.println("Undetermined Unicode composition:" + c);
                }
            } else if (Character.isSurrogate(c)) {<!-- -->
                // Check if it is composed of four Unicode characters
                if (i < str.length() - 1 & amp; & amp; Character.isHighSurrogate(str.charAt(i + 1))) {<!-- -->
                    if (i < str.length() - 2 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 2))) {<!-- -->
                        if (i < str.length() - 3 & amp; & amp; Character.isLowSurrogate(str.charAt(i + 3))) {<!-- -->
                            System.out.println("Four Unicodes form a whole:" + c + str.charAt(i + 1) + str.charAt(i + 2) + str.charAt(i + 3));
                            StringBuilder s = new StringBuilder();
                            s.append(c);
                            s.append(str.charAt(i + 1));
                            s.append(str.charAt(i + 2));
                            s.append(str.charAt(i + 3));
                            nikenameUnicodes.add(s.toString());
                            i + = 3; // The current character has been processed, skip the next three characters
                        }
                    }
                } else {<!-- -->
                    System.out.println("Undetermined Unicode composition:" + c);
                    nikenameUnicodes.add("*");
                }
            } else {<!-- -->
                //Single Unicode
                System.out.println("Single Unicode:" + c);
                nikenameUnicodes.add(String.valueOf(c));
            }
        }

        System.out.println("The first complete Unicode character:" + nikenameUnicodes.get(0));
        System.out.println("The last complete Unicode character:" + nikenameUnicodes.get(nikenameUnicodes.size() - 1));
    }