38 | WebSocket: TCP in the Sandbox

When I talked about the TCP/IP protocol stack before, I said that there is “TCP Socket”, which is actually a functional interface through which the TCP/IP protocol stack can be used to send and receive data at the transport layer.

So, did you know that there is another thing called “WebSocket”?

Judging from the names alone, “Web” refers to HTTP and “Socket” refers to socket calls. So what does the connection of these two mean?

The so-called “just as the name implies”, you can probably guess that “WebSocket” is a Socket communication specification that runs on the “Web”, that is, HTTP, and provides similar functions to “TCP Socket”. You can use it like “TCP Socket” The lower protocol stack is also called to send and receive data arbitrarily.

To be more precise, “WebSocket” is a lightweight network communication protocol based on TCP, which is “on the same level” as HTTP in terms of status.

Why WebSocket

However, there is already a widely used HTTP protocol, so why create another WebSocket? What are its benefits?

In fact, WebSocket, like HTTP/2, was born to solve certain deficiencies of HTTP. HTTP/2 targets “head of line blocking”, while WebSocket targets the “request-reply” communication model.

So, what’s wrong with “request-reply”?

“Request-response” is a “Half-duplex” communication mode. Although data can be sent and received in both directions, there can only be action in one direction at a time, resulting in low transmission efficiency. More importantly, it is a “passive” communication mode. The server can only “passively” respond to the client’s request and cannot actively send data to the client.

Although later HTTP/2 and HTTP/3 added features such as Stream and Server Push, “request-response” is still the main working method. This makes it difficult for HTTP to be used in areas that require “real-time communication” such as dynamic pages, instant messaging, and online games.

Before the advent of WebSockets, developing real-time web applications using JavaScript in a browser environment was cumbersome. Because the browser is a “restricted sandbox” and cannot use TCP, only the HTTP protocol is available, so many “workaround” technologies have emerged. “Polling” (polling) is more commonly used. A kind of.

Simply put, polling means constantly sending HTTP requests to the server to ask if there is data. If there is data, the server will respond with a response message. If the polling frequency is relatively high, then the effect of “real-time communication” can be achieved approximately.

However, the disadvantages of polling are also obvious. Repeatedly sending invalid query requests consumes a lot of bandwidth and CPU resources, which is very uneconomical.

Therefore, in order to overcome the shortcomings of the HTTP “request-response” model, WebSocket “came into being”. It was originally part of HTML5, and later “stand on its own” to form a separate standard, RFC document number is 6455.

Features of WebSocket

WebSocket is a truly “Full-duplex” communication protocol. Like TCP, both the client and the server can send data to each other at any time, instead of “you shoot one, I shoot one” like HTTP. “Courtesy”. As a result, the server can become more “active”. Once there is new data in the background, it can be “pushed” to the client immediately without the need for client polling, and the efficiency of “real-time communication” is improved.

WebSocket adopts a binary frame structure, and its syntax and semantics are completely incompatible with HTTP. However, because its main operating environment is the browser, in order to facilitate promotion and application, it has to “hitchhike” and try to move closer to HTTP in terms of usage habits. That’s what the “Web” in its name means.

In terms of service discovery, WebSocket does not use the “IP address + port number” of TCP, but continues the URI format of HTTP, but the protocol name at the beginning is not “http”, and two new names are introduced: “ ws” and “wss” respectively represent the plaintext and encrypted WebSocket protocols.

The default ports of WebSocket are also 80 and 443, because the firewall on the Internet now blocks most ports and only “releases” HTTP ports 80 and 443, so WebSocket can “disguise” as HTTP protocol, which is relatively easy “Penetrate” the firewall and establish a connection with the server. The details of how to “disguise” will be discussed later.

Here are a few examples of WebSocket services, which are almost exactly the same as HTTP:

ws://www.chrono.com
ws://www.chrono.com:8080/srv
wss://www.chrono.com:445/im?user_id=xxx

One thing to note is that the name of WebSocket is easily misleading. Although in most cases we will call API in the browser to use WebSocket, it is not a “collection of calling interfaces”, but a communication protocol, so I think it would be more appropriate to understand it as “TCP over Web“.

WebSocket frame structure

As mentioned just now, WebSocket also uses binary frames. With previous experience with HTTP/2 and HTTP/3, I believe you can quickly master the message structure of WebSocket this time.

However, WebSocket and HTTP/2 have different focuses. WebSocket focuses more on “real-time communication”, while HTTP/2 focuses more on improving transmission efficiency, so the frame structures of the two are also very different.

Although WebSocket has “frames”, it does not define a “stream” like HTTP/2, so there are no complex features such as “multiplexing” and “priority”, and it itself is “full duplex”. There is no need for “server push”. So in summary, WebSocket frames will be simpler to learn.

The following figure is the frame structure definition of WebSocket. The length is not fixed, with a minimum of 2 bytes and a maximum of 14 bytes. It looks complicated, but is actually very simple.

The first two bytes are necessary and the most critical.

The first bit of the first byte “FIN” is the flag bit for the end of the message, which is equivalent to “END_STREAM” in HTTP/2, indicating that the data has been sent. A message can be split into multiple frames. After the receiver sees “FIN”, it can put the previous frames together to form a complete message.

The three bits after “FIN” are reserved bits and currently have no meaning, but must be 0.

The last 4 bits of the first byte are very important and are called “Opcode“. The opcode is actually the frame type. For example, 1 means that the frame content is plain text, 2 means that the frame content is binary data, and 8 is to close the connection, 9 and 10 are PING and PONG respectively to keep the connection alive.

The first bit of the second byte is the mask flag “MASK“, which indicates whether the frame content is simply encrypted using the XOR operation (xor). The current WebSocket standard stipulates that the client must use a mask when sending data, and the server must not use a mask when sending data.

The last 7 bits of the second byte are “Payload len“, which indicates the length of the frame content. It is another variable-length encoding, with a minimum of 7 bits and a maximum of 7 + 64 bits, which is an additional 8 bytes, so the maximum size of a WebSocket frame is 2^64.

The length field is followed by “Masking-key”, the masking key, which is determined by the flag “MASK” above. If the mask is used, it is a 4-byte random number, otherwise it is does not exist.

After such analysis, in fact, the frame header of WebSocket has four parts: “End flag + operation code + frame length + mask”, which is just a “little trick” using variable length encoding, unlike HTTP/2 fixed-length headers are so simple and clear.

Our experimental environment uses OpenResty’s “lua-resty-websocket” library to implement a simple WebSocket communication. You can access the URI “/38-1”, which will connect to the back-end WebSocket service “ws://127.0.0.1 /38-0”, you can use Wireshark to capture the packet and you can see the entire communication process of WebSocket.

The screenshot below is one of the text frames. Because it is sent by the client, it needs to be masked. The message header has four more bytes of “Masking-key” in addition to the two bytes. The total is 6 bytes.

The message content is masked and is not directly visible as plain text, but the security strength of the mask is almost zero. You can convert the plain text by simply XORing it with “Masking-key”.

WebSocket handshake

Like TCP and TLS, WebSocket also requires a handshake process before data can be officially sent and received.

Here it still takes a “free ride” on HTTP, taking advantage of the “protocol upgrade” feature of HTTP itself to “disguise” as HTTP, so that it can bypass browser sandboxes, network firewalls, etc., which is also the difference between WebSocket and Another important point of relevance for HTTP.

The WebSocket handshake is a standard HTTP GET request, but with two dedicated header fields for protocol upgrades:

“Connection: Upgrade” means that the protocol is required to be “upgraded”;

“Upgrade: websocket” means to “upgrade” to the WebSocket protocol.

In addition, in order to prevent ordinary HTTP messages from being “accidentally” recognized as WebSocket, the handshake message also adds two additional authentication header fields (the so-called “Challenge”):

Sec-WebSocket-Key: a Base64-encoded 16-byte random number, used as a simple authentication key;

Sec-WebSocket-Version: The version number of the protocol, currently it must be 13.

When the server receives the HTTP request message and sees the four fields above, it knows that this is not an ordinary GET request, but a WebSocket upgrade request, so it does not follow the ordinary HTTP processing flow, but constructs a special ” 101 Switching Protocols” response message, notifying the client that HTTP will no longer be used and the WebSocket protocol will be used for communication. (Kind of like TLS’s “Change Cipher Spec”)

WebSocket’s handshake response message also has a special format. The field “Sec-WebSocket-Accept” must be used to verify the client request message, also to prevent incorrect connections.

The specific method is to add the value of “Sec-WebSocket-Key” in the request header to a dedicated UUID “258EAFA5-E914-47DA-95CA-C5AB0DC85B11”, and then calculate the SHA-1 digest.

encode_base64(
  sha1(
    Sec-WebSocket-Key + '258EAFA5-E914-47DA-95CA-C5AB0DC85B11' ))

When the client receives the response message, it can use the same algorithm to compare whether the values are equal. If they are equal, it means that the returned message is indeed the server it was connected to during the handshake, and the authentication is successful.

After the handshake is completed, the subsequent data transmitted is no longer an HTTP message, but a binary frame in WebSocket format.

Summary

The browser is a “sandbox” environment with many restrictions and does not allow the establishment of a TCP connection to send and receive data. With WebSocket, we can directly establish a “TCP connection” with the server in the browser and gain more freedom.

However, freedom also comes at a price. Although WebSocket is in the application layer, its usage is similar to “TCP Socket”. It is too “original”. Users must manage the connection, cache, and status by themselves. The development is much more complicated than HTTP, so whether Introducing WebSocket into your project must be carefully considered.

The “request-response” mode of HTTP is not suitable for developing “real-time communication” applications. It is inefficient and difficult to implement dynamic pages, so WebSocket appeared;

WebSocket is a “full-duplex” communication protocol, which is equivalent to a “thin wrapper” for TCP, allowing it to run in a browser environment;

WebSocket uses HTTP-compatible URIs to discover services, but new protocol names “ws” and “wss” are defined, and the port numbers continue to be 80 and 443;

WebSocket uses binary frames and has a relatively simple structure. The special thing is that there is a “mask” operation. The client must mask the data sent, but the server does not;

WebSocket uses the HTTP protocol to implement a connection handshake. Sending a GET request requires a “protocol upgrade”. There is a very simple authentication mechanism during the handshake process to prevent mistaken connections.