Big data preprocessing and collection experiment three: Urllib’s GET and POST requests (1)

Table of Contents

Urllib basic operations-GET

?No UTF-8 encoded output

?Output after utf-8decode

? Timeout parameter: capture exceptions caused by connection timeout

◆Basic operation of Urllib-customized request header

? Add multiple access parameters to the GET request

◆Basic operation of Urllib-POST

?Youdao Dictionary web crawling: Headers of the found data packets can record the relevant data of the request

?View the parameters carried by the request

Urllib3

◆Ullib3 completes the get request


Urllib basic operation-GET

?
First import the urllib module, define the URL you want to access, and use urlopen() to send a request to the URL in the parameter.

?
urllib.request.urlopen(
url, data=None, [timeout, ]
*, cafile=None,

capath=None, cadefault=False, context=None)

# Use urllib to obtain the source code of Baidu homepage

import
urllib.request

# 1. Define a url which is the address to be accessed

url =
http://www.baidu.com’

# 2. Simulate the browser to send a request to the server (requires Internet connection) response=response

response = urllib.request.urlopen(url)

# 3. Get the meaning of the page source code content in the response

content = response.read()

# read method returns byte binary data

print(content)

# We want to convert binary data into a string, binary–>string decoding decode(‘encoding format’)

content = response.read().decode(
‘utf-8’
)

print(content)

# Use urllib to obtain the source code of Baidu homepage
import urllib.request

# 1. Define a url which is the address to be accessed
url = 'http://www.baidu.com'

# 2. Simulate the browser to send a request to the server (requires Internet connection) response=response
response = urllib.request.urlopen(url)

# 3. Get the meaning of the page source code content in the response
content = response.read()
print(content)
# read method returns byte binary data
# We want to convert binary data to string
# Binary-->String decoding decode('Encoding format')
content = response.read().decode('utf-8') # This step is very important


# 4. Print data
print(content)

# #Get the status code. If it is 200, then it proves that our logic is correct.
#print(response.getcode())
#
# #Return url address
#print(response.geturl())
#
# #Getting is a status information
#print(response.getheaders())

?No utf-8 encoded output

?Output after utf-8decode

? Timeout parameter: Capture exceptions caused by connection timeout

# Use urllib to obtain the source code of Baidu homepage
import urllib.request

# 1. Define a url which is the address to be accessed
url = 'http://www.baidu.com'

# 2. Simulate the browser to send a request to the server (requires Internet connection) response=response
response = urllib.request.urlopen(url)

# 3. Get the meaning of the page source code content in the response
content = response.read()
print(content)
# read method returns byte binary data
# We want to convert binary data to string
# Binary-->String decoding decode('Encoding format')
content = response.read().decode('utf-8') # This step is very important

# # timeout parameter
response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

import socket
import urllib.error
#
try:
     response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
  if isinstance(e.reason,socket.timeout):
       print('TIME OUT')


# 4. Print data
print(content)

# #Get the status code. If it is 200, then it proves that our logic is correct.
#print(response.getcode())
#
# #Return url address
#print(response.geturl())
#
# #Getting is a status information
#print(response.getheaders())

Urllib basic operation-customized request header

?
When crawling a web page, words such as “Sorry, cannot be accessed” sometimes appear in the output information. This means that crawling is prohibited.

You need to solve this problem by customizing the request headers. Customizing Headers is a solution to requests being rejected

One of the best ways is to enter this web server and pretend that we are crawling data. Request header

Headers provide information about the request, response, or other sending entity. If there are no custom request headers or requests,

If the header is inconsistent with the actual web page, the correct results may not be returned.

?
The method to obtain the Headers of a web page is as follows: Use 360, Firefox or Google Chrome to open a URL (such as

“http://www.baidu.com”), right-click on the web page and select “View Metadata” in the pop-up menu.

“Element”, then refresh the web page, and then follow the steps shown in Figure 3-4, first click the “Network” tab, and then click

“Doc”, then click on the URL under “Name”, Headers information similar to the following will appear:

?
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,

like Gecko) Chrome/46.0.2490.86 Safari/537.36

?
Add access parameters to the GET request: Search Beijing in Baidu and obtain the search results page

?
Is it possible to directly define url=https://www.baidu.com/s?wd=Beijing?

?
No, by default, only ascii encoding is searched, and there is no word “Beijing”, so it needs to be converted into unicode for machine recognition.

# #Add access parameters to the get request
import urllib.request
import urllib.parse
#Directly copy some URLs for searching Beijing:
# https://www.baidu.com/s?wd=Jay Chou

# Requirement: Obtain the web page code of https://www.baidu.com/s?wd=Beijing
# Is it possible to directly define url=https://www.baidu.com/s?wd=Beijing?
#No, by default, only ascii encoding is searched, and there is no word "Beijing", so it needs to be converted into unicode for machine recognition.

# Find url
url = "https://www.baidu.com/s?wd="
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36 Edg/103.0.1264.44'
}
# Use quote method for encoding conversion
name = urllib.parse.quote("Beijing")
#Assemble url
url = url + name
# print(url)
# Request object customization
request = urllib.request.Request(url=url, headers=headers)
# Send a request to the server
response = urllib.request.urlopen(request)
# Get response information
content = response.read().decode('utf-8')
#Print response information
print(content)

#When there is more than one parameter, request header customization
#There is more than one parameter, and you can use the & amp; symbol to link multiple parameters. Suppose we add a Lianghui. A problem arises. Not only do we need to encode Beijing into unicode, we also need to encode the two sessions. We can use quote to convert one by one, but it is inefficient and requires splicing. To solve the multi-parameter problem, we can use urlencode to help us
#urlencode requires the parameters inside to exist in dictionary form, separated by commas

? Add multiple access parameters to the GET request

?
There is more than one parameter, which can be linked after conversion using the & symbol.

?
In order to facilitate the solution of multi-parameter conversion and splicing problems, we can use urlencode to help us; urlencode requires the parameters inside

Exists in dictionary form, separated by commas

import urllib.request
import urllib.parse
url = 'https://cn.bing.com/search?'

data={
        'go':'Search',
        'q':'Beijing weather'
     }
new_data = urllib.parse.urlencode(data)
print(new_data)
headers={
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Cookie': 'BIDUPSID=83261851D92939FFFF2D2C3800B6CCA2; 4753; BAIDUID=ED1F16239BBD2AB0CF8AF7923E3A68DE:FG=1; ispeed_lsm=2; W~1V9pbmcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACMONWMjDjVjb; BDUSS_BFESS=XVyNi1XcXZ- BDORZ=B490B5EBF6F3CD4 02E515D22BCDA1598; BA_HECTOR=8l0g0l2ga00h25a52g81dkhm1hk9pd81a; BAIDUID_BFESS=ED1F16239BBD2AB0CF8AF7923E3A68DE:FG=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; del Per=0; BD_CK_SAM=1; PSINO=7; ZFY=SR4hfozWRIXmU7ouv2ASem0KdSz0WImntiWy4T8Nftw:C; BD_HOME=1 ; baikeVisitId=53b5daaa-05ec-4fc4-b9d5-a54ea3e0658d; 37482_37497_26350_37365_37455; H_PS_645EC=878fjGnEi/QTHR5lTn8cql/qGCKSJk5xVRVe/WpoH2dRPvRJayxDhPJv8U3BoEGTXa+d; COOKIE_SESSION=1103_9_9_9_19_6_0_0 _9_2_0_0_2611_8863_3_0_1665474424_1665471813_1665474421|9#358_1132_1665459981|9; BDSVRTM=0'
}
url=url + new_data
print(url)
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)

Urllib basic operation-POST

?
The urllib.request module implements an example of sending a POST request to obtain web page content.

?
The parameters of the post request must be encoded first, using url.parse.urlencode, and the return value type is word

String type

?
The encoding result also needs to be converted into byte type: the previously defined data is a string type. while sending the request

In request, the required data is of byte type (otherwise urlopen will report an error)

You can use data = bytes(data, utf-8’) or data = data.encode(‘utf-8’)

?
Different from the GET request, the post parameters are not directly spelled after the url, but placed inside the parameters of the request object.

?Youdao Dictionary web crawling: The found headers of the data packets can record the requests Relevant data

?
Request link https://dict.youdao.com/jsonapi_s?doctype=json &jsonversion=4

?
Request method: POST request header

import urllib.request
import urllib.parse

# Please enter the content you want to translate
content ='Hello' #Finally use input('Please enter the content you want to translate:') to replace 'Hello'
url = 'https://dict.youdao.com/jsonapi_s?doctype=json &jsonversion=4'
headers = {
  "Cookie": '[email protected]; -ad-closed=1; ___rl__test__cookies=1649216072438',
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"
}
# Carrying data
data = {
    'q': content,
    'le': 'ja',
    't': '9',
    'client': 'web',
    'sign': '520a657bfae6f88b2deaa67067865128',
    'keyfrom': 'webdict',
   }


data = urllib.parse.urlencode(data).encode('utf-8')
req = urllib.request.Request(url, data=data)
res = urllib.request.urlopen(req)
html = res.read().decode('utf-8')
print('Translation result:\\
', html)

?View the parameters carried in the request

When the words we query are different, the sign parameter will be different.

When the language used in our query is different, the le parameter will be different.

Urllib3

?
urllib3 is a powerful, well-organized Python library for HTTP clients. Many Python

Native systems have begun to use urllib3. urllib3 provides many important features not found in the Python standard library.

Features, including: thread safety, connection pooling, client SSL/TLS verification, file partial encoding upload, assistance processing

Handle repeated requests and HTTP relocation, support compression encoding, support HTTP and SOCKS proxy, 100% tested

Coverage etc.

?
Before using urllib3, you need to open a cmd window and use the following command to install it

pip install urllib3

Urllib3 completes the get request

import urllib3
http=urllib3.PoolManager()
response=http.request('GET','http://www.baidu.com'
)
print(response.status)
print(response.data)