Unicode, URL encoding, HTML entity symbol encoding

Character encoding

Inside a computer, all information is ultimately represented by binary values.
Character encoding is the encoding specification for converting natural language into computer language.

1. ASCII code

The ASCII code specifies a total of 128 character encodings, including uppercase and lowercase English letters, numbers and some symbols. For example, the space SPACE is 32 (binary 00100000), and the uppercase letterA is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) only occupy the last 7 bits of a byte, and the first bit is uniformly specified as 0.

2. Unicode and UTF-8

English is encoded with 128 symbols, but to represent other languages, 128 symbols are not enough. Chinese requires at least two bytes, and it cannot conflict with ASCII encoding. Therefore, China has formulated the GB2312 encoding, which uses two bytes to represent a Chinese character. In theory, it can represent up to 256 x 256 = 65536 symbols.

But there are hundreds of languages in the world. Japan compiles Japanese into Shift_JIS, and South Korea compiles Korean into Euc-kr. Each country has its own standards, so it will not work To avoid conflicts, the result is that in multi-language mixed text, there will be garbled characters displayed.

Therefore, the Unicode character set came into being. Unicode unifies all languages into one set of codes, so that there will be no more garbled characters. The Unicode standard is also evolving, but the most commonly used is the UCS-16 encoding, which uses two bytes to represent a character (if you want to use very remote characters, you need 4 bytes).

The difference between ASCII encoding and Unicode encoding:

  • ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

A new problem has appeared again: if unified into Unicode encoding, the problem of garbled characters will disappear from then on. However, if the text is basically all in English, encoding with Unicode requires twice as much storage space as encoding with ASCII.

Therefore, the UTF-8 encoding that converts Unicode encoding into “variable length encoding” came into being. The relationship here is that UTF-8 is one of the implementations of Unicode.

UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes. Commonly used English letters are encoded into 1 byte, and Chinese characters are usually 3 bytes. Only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you want to transmit contains a lot of English characters, UTF-8 encoding can save space.

At present, the general character encoding working mode of computer system is as follows:

  • In computer memory, Unicode encoding is uniformly used, and when it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF-8 encoding.

  • When editing with Notepad, the UTF-8 characters read from the file are converted into Unicode characters and stored in the memory. After editing, the Unicode is converted to UTF-8 and saved to the file when saving.

  • When browsing the web, the server will convert the dynamically generated Unicode content into UTF-8 and then transmit it to the browser.

UTF-8 encoding rules:

  1. For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII encoding are the same.
  2. For a symbol of n bytes (n > 1), the first n bits of the first byte are all set to 1, the n + 1 bit is set to 0, and the first two digits of the subsequent bytes are all set to 10. The remaining binary bits not mentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, the letters x indicate the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(hex) | (binary)
---------------------- + --------------------------- ------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, it is very simple to interpret UTF-8 encoding. If the first bit of a byte is 0, this byte alone is a character; if the first bit is 1, how many consecutive 1 , which means how many bytes the current character occupies.

Take the Chinese character strict as an example to demonstrate how to implement UTF-8 encoding:

Strict Unicode is 4E25 (100111000100101), according to the above table, it can be found that 4E25 is in the third line range (0000 0800 - 0000 FFFF), so strict UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx . Then, starting from the last binary bit of strict, fill in the x in the format from the back to the front, and fill in the extra bits with 0 . In this way, the strict UTF-8 encoding is 11100100 10111000 10100101, and the conversion to hexadecimal is E4B8A5.

3. URL encoding

URL encoding (URL encoding), also known as percent-encoding (Percent-encoding), is a context-specific encoding mechanism for Uniform Resource Locators.

URL encoding is a format used by browsers to package form input. The browser gets all the names and their values from the form, encodes them as name/value parameters (remove characters that cannot be transmitted, sorts the data, etc.) and sends them to the server as part of the URL or separately. In either case, the form input format on the server side is as follows:

theName=Ichabod + Crane & amp;gender=male & amp;status=missing & amp; headless=yes

URL Encoding Rules:

  • Each name/value pair is separated by & amp;; character; each name/value pair from the form is separated by = character. If the user does not enter a value for the name, the name still appears, but without a value.

  • Any special characters (non-seven-bit ASCII, such as Chinese characters) will be encoded in hexadecimal with the percent sign %, and of course include characters like = & amp; ; and % these special characters. URL encoding is specifically implemented with the UrlEncode function, which converts the characters to be transcoded into hexadecimal, and then takes 4 digits from right to left (less than 4 digits are directly processed), and make one for every 2 digits , with % in front, encoded into %XY format.

In fact, URL encoding is a hexadecimal ASCII code of a character. But there is a slight change, you need to add “%” in front. For example, “”, its ascii code is 92, the hexadecimal of 92 is 5c, so the URL encoding of “” is \.

4. HTML Entity Symbol Encoding

In HTML, certain characters are reserved. For example and , this is because they will be mistaken for tags in the browser. Reserved characters in HTML must be replaced with character entities.

  • An HTML entity is a piece of text (string) that begins with a hyphen ( & amp; ) and ends with a semicolon (;).

  • Entities are often used to display reserved characters (which are parsed as HTML code) and invisible characters (such as “no newline space”).

To put it simply, in order to avoid confusion between the characters entered by the user and the syntax of HTML when the browser parses, it is stipulated that if you want to enter the reserved characters in the HTML syntax, you must use character entities to replace them. For example, to display the less than sign, you must write: & amp;lt; or & amp;#60;.

Commonly used HTML character entities

Display Results Description Entity Name entity number
space & amp;nbsp; & amp;#160;
less than sign & amp; lt; & amp;#60;
> greater than sign & amp;gt; & amp;#62;
& amp; & amp;amp; & amp;#38;
Quotation marks & amp; quot; & amp;#34;
apostrophe & amp;apos; (IE does not support) & amp;#39;
& amp;cent; & amp;#162;
& amp;pound; & amp;#163;
RMB/JPY & amp;yen; & amp;#165;
Euro & amp;euro; & amp;#8364;
§ section & amp ;sect; & amp;#167;
? Copyright & amp;copy; & amp;#169;
? registered trademark & amp;reg; & amp;#174;
? Trademark & amp;trade; & amp;#8482;
< Multiplication sign & amp;times; & amp;#215;
÷ Division & amp;divide; & amp;#247;

It should be noted that while HTML is case-insensitive, entity characters are case-sensitive.