How Python crawls real-time changing WebSocket data

1. Preface

As a crawler engineer, I often encounter the need to crawl real-time data at work, such as real-time data on sports events, real-time data on the stock market, or data on real-time changes in the currency circle. As shown below:

In the Web field, there are two methods used to achieve ‘real-time’ update of data: polling and WebSocket. Polling refers to the client accessing the server interface at a certain time interval (such as 1 second) to achieve a ‘real-time’ effect. Although it seems that the data is updated in real time, in fact it has a certain time interval and is not updated in real time. Not really a real time update. Polling usually adopts the pull mode, where the client actively pulls data from the server.

WebSocket adopts the push mode, where the server actively pushes data to the client. This method is truly real-time update.

2. What is WebSocket

WebSocket is a protocol for full-duplex communication over a single TCP connection. It makes data exchange between client and server simpler, allowing the server to actively push data to the client. In the WebSocket API, the browser and the server only need to complete a handshake, and a persistent connection can be created directly between the two for bidirectional data transmission.

WebSocket Advantages

  • Less control overhead: You only need to perform a handshake and carry the request header information once, and then only transmit data. Compared with HTTP, which carries request headers in every request, WebSocket is very resource-saving.
  • Stronger real-time performance: Because the server can actively push messages, this makes the delay negligible. Compared with the HTTP polling interval, WebSocket can perform multiple transmissions within the same time.
  • Binary support: WebSocket supports binary frames, which means more economical transfers.

Crawler faces HTTP and WebSocket

There are many network request libraries in Python, and Requests is one of the most commonly used request libraries. It can simulate sending network requests. But these requests are based on HTTP protocol. Requests plays an unexpected role when facing WebSocket, and a library that can connect to WebSocket must be used.

3. Crawling ideas

Here is an example of real-time data from the Litecoin official website http://www.laiteb.com/. The WebSocket handshake only occurs once, so if you need to observe the network request through the browser developer tools, you need to open the browser developer tools with the page open, locate the NewWork tab, and enter or refresh the current page. Observe WebSocket handshake requests and data transfers. Here is the Chrome browser as an example:

Filtering functionality is available in the developer tools, where the WS option stands for showing only network requests for WebSocket connections.

At this time, you can see that there is a record named realTime in the request record list. After clicking it with the left mouse button, the developer tools will be divided into two columns. The detailed information of this request record is listed on the right:

Unlike HTTP requests, WebSocket connection addresses begin with ws or wss. The status code for a successful connection is not 200, but 101.

The Headers tab records Request and Response information, while the Frames tab records the data transmitted between the two parties, which is also the data content we need to crawl:

The data with the upward green arrow in the Frames diagram is the data sent by the client to the server, and the data with the downward orange arrow is the data pushed by the server to the client.

As can be seen from the data sequence, the client sends first:

{"action":"subscribe","args":["QuoteBin5m:14"]}

Then the server will push the information (always pushed):

{"group":"QuoteBin5m:14","data":[{"low":"55.42","high":"55.63", "open":"55.42","close":"55.59","last_price":"55.59","avg_price":"55.5111587372932781077"," volume":"40078","timestamp":1551941701,"rise_fall_rate":"0.0030674846625766871","rise_fall_value":"0.17","base_coin_volume":" 400.78","quote_coin_volume":"22247.7621987324"}]}


Therefore, the entire process from initiating handshake to obtaining data is:

So, now comes the question:

  • How to shake hands?
  • How to maintain connection?
  • How to send and receive messages?
  • Is there any library that can easily implement this?

4. aiowebsocket

There are many Python libraries used to connect to WebSocket, but the easy-to-use and stable ones include websocket-client (non-asynchronous), websockets (asynchronous), and aiowebsocket (asynchronous).

You can choose one of the three according to project requirements. Today we introduce the asynchronous WebSocket connection client aiowebsocket. Its Github address is: https://github.com/asyncins/aiowebsocket.

ReadMe introduces: AioWebSocket is an asynchronous WebSocket client that follows the WebSocket specification. It is lighter and faster than other libraries.

Its installation is as simple as other libraries, just use pip install aiowebsocket. After installation, we can test according to the sample code provided in ReadMe:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        message = b'AioWebSocket - Async WebSocket Client'
        while True:
            await converse.send(message)
            print('{time}-Client send: {message}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), message=message))
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))

if __name__ == '__main__':
    remote = 'ws://echo.websocket.org'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

The result output after running is:

2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client send: b'AioWebSocket - Async WebSocket Client'
…

send represents a message sent from the client to the server

recive represents the message pushed by the server to the client

5. Encoding to obtain data

Back to this crawling requirement, the target website is the Litecoin official website:

From the network request record just now, we know that the WebSocket address of the target website is: wss://api.bbxapp.vip/v1/ifcontract/realTime. From the address, we can see that the target website uses is wss, which is the secure version of ws, and their relationship is the same as HTTP/HTTPS. aiowebsocket will automatically process and recognize SSL, so we don’t need to do any additional operations. We only need to assign the target address to the connection uri:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))

if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

Observe the output after running the code and you will see that nothing happened. There is neither content output nor disconnection, the program keeps running, but nothing:

Why is this?

Does the other party not accept our request?

Or are there any anti-crawler restrictions?

In fact, the flow chart just now can explain this problem:

One step in the entire process requires the client to send a specified message to the server, and the server will continue to push data after verification. Therefore, the message sending code should be added before the message is read and after the handshake connection:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        # Client sends message to server
        await converse.send('{"action":"subscribe","args":["QuoteBin5m:14"]}')
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))

if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')


After saving and running, you will see the data being pushed continuously:

At this point, the crawler can obtain the desired data.

What does aiowebsocket do

The code is not long. When using it, you only need to fill in the WebSocket address of the target website and then send the data according to the process. So what does aiowebsocket do in this process?

  • First, aiowebsocket sends a handshake request to the specified server based on the WebSocket address and verifies the handshake result.
  • Then, after confirming that the handshake is successful, the data is sent to the server.
  • In order to keep the connection open throughout the process, aiowebsocket will automatically respond to the server with ping pong.
  • Finally, aiowebsocket reads the message pushed by the server

Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.

1. Python learning routes in all directions

The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.

img
img

2. Essential development tools for Python

The tools have been organized for you, and you can get started directly after installation! img

3. Latest Python study notes

When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.

img

4. Python video collection

Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher’s ideas in the video, from basic to in-depth.

img

5. Practical cases

What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.

img

6. Interview Guide

Resume template

If there is any infringement, please contact us for deletion.