Encoding (ASCII code, urlcode, html entity encoding, unicode, utf-8, html status code)

ASCII code

ASCII (American Standard Code for Information Interchange): American Standard Code for Information Interchange is a computer coding system based on the Latin alphabet, mainly used to display modern English and other Western European languages. It is the most common information exchange standard and is equivalent to the international standard ISO/IEC 646.
ASCII codes use specified combinations of 7- or 8-bit binary numbers to represent 128 or 256 possible characters. Standard ASCII code, also called basic ASCII code, uses 7 binary digits (the remaining 1 binary digit is 0) to represent all uppercase and lowercase letters, numbers 0 to 9, punctuation marks, and special controls used in American English character. Among them: 0~31 and 127 (33 in total) are control characters or special communication characters (the rest are displayable characters)
Such as control characters: LF (line feed), CR (carriage return), FF (page feed), DEL (delete), BS (backspace), BEL (bell), etc.;
Special characters for communication: SOH (head of text), EOT (end of text), ACK (confirmation), etc.;
ASCII values 8, 9, 10, and 13 translate to backspace, tab, linefeed, and carriage return characters, respectively. They do not have a specific graphic display, but will have different effects on text display depending on the application.
32 to 126 (95 in total) are characters (32 is a space), and 48 to 57 are ten Arabic numerals from 0 to 9.
65-90 are 26 uppercase English letters, 97-122 are 26 lowercase English letters, and the rest are some punctuation marks, operators, etc.
Also note that in standard ASCII its highest bit (b7) is used as a parity bit. The so-called parity check refers to a method used to check whether there is an error in the code transmission process, generally divided into two types: odd check and even check. Odd check rule: the number of 1s in a byte of the correct code must be an odd number, if it is not an odd number, add 1 to the highest bit b7; even check rule: the number of 1s in a byte of the correct code must be an even number , if it is not an even number, add 1 to the highest bit b7.
The latter 128 are called extended ASCII codes. Many x86-based systems support the use of extended (or “high”) ASCII. Extended ASCII codes allow the eighth bit of each character to be used to identify an additional 128 special symbol characters, foreign language letters, and graphic symbols.

URL encoding

What is URL encoding
Coding refers to the behavior of converting numbers and characters to enable them to be displayed in the computer, because the computer can only recognize 0, 1 signals, so the information input into the computer must be converted into 0, 1 string form to be recognized by the computer. In computers, the essence of encoding is to store or transmit information, but because binary encoding is too cumbersome, people usually use decimal or hexadecimal representation when dealing with encoding, such as the well-known ASCII code is expressed in decimal character.

URL encoding is a common encoding form used by browsers. Early URL encoding limited the scope to the URL and encoded some characters in the URL. In practical applications, URL encoding is used in HTTP body, header and other parts. When the URL path or query parameters contain Chinese characters or page number characters, the URL needs to be encoded (hexadecimal encoding). The principle of URL is to use safe characters to represent those unsafe characters.
Safe characters are characters that have no special purpose or meaning.
How URLs are encoded

The component URL is composed of some simple components, such as protocol, domain name, port number, path, query and string, etc., such as
http://www.ccc.net/index?param=10
    **Protocol (scheme)** is the method for the browser to request server resources, and the above http:// indicates that the http protocol is used. Among the various protocols supported by the browser, the http protocol is the default, that is to say, if you do not enter the protocol when entering the URL in the browser address bar, but directly enter www.ccc.net, the browser access result is http:// www.ccc.net.

    In today's applications, http is considered insecure most of the time, so more and more websites start to use the encrypted version of http-https protocol

    **Host (host)** is the name of the website or server where the resource is located, also known as the domain name.

    The above-mentioned www.ccc.net is the host, and some hosts have no domain name but only an ip address. This also involves the working principle of the DNS domain name resolver. The above content has already written a more detailed process.

    Port: There may be multiple websites under the same domain name, and they are distinguished by ports. It can be understood that the visitor tells the server which port he wants to visit. The default port of the HTTP protocol is 80. If this parameter is omitted, the server will return the website on port 80.

    **path (path)** is the location of the resource on the website. The above /index represents the path. Earlier paths referred to real physical addresses, now servers can simulate these addresses, so now paths refer to virtual addresses. A path can contain only directories, not filenames.

    **Query parameters (parameter)** are additional information provided to the server. There can be one or more sets of query parameters. No pair of parameters are in the form of key-value pairs (key-value pair), with both a key name (key) and a key-value (value), and the = link between them.

    Anchor (anchor) is the anchor point inside the webpage, use # plus the anchor name, and put it at the end of the URL, such as #anchor. After the browser loads the page, it will automatically scroll to the position where the anchor point is located.

Reserved characters: URLs specify some characters with special meanings, which are often used to separate two different URL components. These characters are called reserved characters.
    **:* is used to split protocol and host components;
    **/** is used to split hosts and paths;
    **?** is used to split paths and query parameters, etc.;
    **=** is used to represent the key-value pair in the query parameter;
    ** & amp;** is used to separate and query multiple key-value pairs
    Other reserved characters **. … # @ $ + ; %**

Character escaping: The character escaping method of the URL is to add (%) to the hexadecimal ASCII code of these characters
    !:twenty one%
    #:twenty three%
    $:24%
     &:26%
    ':27%
    (:(
    ):)
    *:*
     + :+
    ,:,
    /:/
    :::
    ;:;
    =:=
    ?:?
    @:@
    [:[
    ]:]

URLs are divided into absolute URLs and relative URLs

Absolute URL: Only the URL itself can determine the location of the group members. This means that the URL must carry complete information about the resource, including protocol, host, path, etc.
Relative URL: The URL does not contain all information about the location of the resource, and must be combined with the location of the current web page to locate the resource. (You can refer to the relative path of the file directory)

There are also two special abbreviations for URLs:

.: indicates the current directory, such as ./a.html, indicates the a.html file in the current directory
..: Indicates the upper-level directory, such as ../a.html (the a.html file in the upper-level directory)
Extended usage: ../../ indicates the upper two levels of directories

Unicode

The scientific name of Unicode is “Universal Multiple-Octet Coded Character Set”, referred to as UCS, also known as Unicode, Unicode, and Unicode.
Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode is of course a large collection, and the current scale can accommodate more than 1 million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character Yan. For the specific symbol correspondence table, you can query unicode.org, or the special Chinese character correspondence table.

HTML encoding

HTML entity encoding, that is, escape characters in HTML.
· In HTML, some characters are reserved, for example, the less than sign < and the greater than sign > cannot be used in HTML, because the browser will mistake them for labels.
· If you want to display the reserved characters correctly, we must use character entities in the HTML source code.
· A common character entity in HTML is a non-breaking space.
The entity is written as &name;, where name is the name of the character. Below are some of these special characters, and their corresponding entities.

<: &lt;

>: &gt;

": &quot;

': &apos;

 &: &amp;

?: &copy;

#: &num;

§: &sect;

$: &yen;

$: &dollar;

£: &pound;

¢: &cent;

%: &percnt;

*: $ast;

@: &commat;

^: &Hat;

±: &plusmn;

Space: &nbsp;

utf-8 encoding

The relationship here is that UTF-8 is one of the implementations of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols.

The encoding rules of UTF-8 are very simple, there are only two or two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII encoding are the same.

2) For an n-byte symbol (n > 1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following byte are all set to 10. The remaining binary bits not mentioned are all Unicode codes of this symbol.

The encoding rules are summarized in the table below, with the letter x denoting the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(hex) | (binary)
———————-±————————— —————–
0000 0000-0000 007F | 0xxxxxxx
0-127
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
128-2047
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
2048-65535
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Web page status code

There are four common types of HTTP status codes:

Beginning with

2 (request successful): Indicates that the request was processed successfully.
Beginning with 3 (request redirected): Indicates that further action is required to complete the request. Typically, these status codes are used for redirection.
Beginning with 4 (request error): Indicates that there may be an error in the request, which hinders the processing of the server.
5 (Server Error): Indicates that the server encountered an internal error while attempting to process the request. These errors may be an error with the server itself, rather than an error with the request.

Start with 2 (request successful)

200 (Success) The server has successfully processed the request. Usually, this means that the server served the requested web page.
201 (Created) The request was successful and the server created a new resource.
202 (Accepted) The server has accepted the request but has not yet processed it.
203 (Non-Authorization Information) The server has successfully processed the request, but the returned information may have come from another source.
204 (No Content) The server successfully processed the request, but did not return any content.
205 (Reset Content) The server successfully processed the request, but did not return any content.
206 (Partial Content) The server successfully processed a partial GET request.

begins with 3 (request is redirected)

300 (multiple choices) The server can perform various operations for the request. The server can choose an operation according to the requester (user agent), or provide a list of operations for the requester to choose.
301 (Moved Permanently) The requested webpage has permanently moved to a new location. When the server returns this response (in response to a GET or HEAD request), it automatically forwards the requester to the new location.
302 (temporarily moved) The server currently responds to requests from web pages in different locations, but the requester should continue to use the original location for future requests.
303 (See other locations) This code is returned by the server when the requester should use separate GET requests for different locations to retrieve the response.
304 (Not Modified) The requested page has not been modified since the last request. When the server returns this response, no webpage content is returned.
305 (Using Proxy) The requester can only use a proxy to access the requested web page. If the server returns this response, it also indicates that the requester should use a proxy.
307 (temporary redirection) The server is currently responding to requests from web pages in different locations, but the requester should continue to use the original location for future requests.

**

Start with 4 (request error)

**

400 (Bad Request) The server did not understand the syntax of the request.
401 (Unauthorized) The request requires authentication. The server might return this response for web pages that require a login.
403 (Forbidden) The server rejected the request.
404 (Not Found) The server could not find the requested webpage.
405 (Method Forbidden) The method specified in the request is disabled.
406 (Not Accepted) Unable to respond to the requested web page with the requested content attributes.
407 (Proxy Authorization Required) This status code is similar to 401 (Unauthorized), but specifies that the requester should be authorized to use a proxy.
408 (Request Timed Out) The server timed out while waiting for the request.
409 (Conflict) The server had a conflict while completing the request. The server MUST include information about the conflict in the response.
410 (Deleted) The server returns this response if the requested resource has been permanently deleted.
411 (Valid Length Required) The server does not accept a request without a Valid Content-Length header field.
412 (Precondition not met) The server did not meet one of the preconditions set by the requester in the request.
413 (The request entity is too large) The server cannot process the request because the request entity is too large and exceeds the processing capacity of the server.
414 (The requested URI is too long) The requested URI (usually a web address) is too long for the server to handle.
415 (Unsupported Media Type) The requested format is not supported by the requested page.
416 (Requested range does not meet requirements) This status code is returned by the server if the page cannot provide the requested range.
417 (Expectation not met) The server did not meet the "Expectation" request header field.

begins with 5 (server error)

500 (Internal Server Error) The server encountered an error and was unable to complete the request.
501 (not yet implemented) The server is not capable of fulfilling the request. For example, this code might be returned when the server does not recognize the request method.
502 (Bad Gateway) The server, acting as a gateway or proxy, received an invalid response from an upstream server.
503 (Service Unavailable) The server is currently unavailable (due to overloading or down for maintenance). Usually, this is only a temporary state.
504 (Gateway Timeout) The server is acting as a gateway or proxy, but did not receive the request from the upstream server in time.
505 (HTTP Version Unsupported) The server does not support the HTTP protocol version used in the request.