002requests module and urllib module

Article directory

  • urllib module
    • urllib understands (c1)
    • **Details of urllib.request module**
      • Commonly used methods (C21)
      • UA format and acquisition method (C21)
      • urllib web page request operation (C31)
    • urllib.parse module in detail
      • Common Methods (C21)
      • urlencode and quote combat (C31)
    • Example analysis of urllib saving pictures (C1)
  • Requests module
    • Introduction to Requests (C1)
    • pip install third-party mirror website (C4)
    • Use of Requests
      • **Common methods of Requests**(C21)
      • Common parameters of Requests (C21)
      • Requests response content (C21)
      • Parameter application of get request in Requests
        • The first one: add parameters to the url link (C31)
        • The second: add parameters to params (C31)
        • Extension: Quickly match parameters in headers to dictionary data (C31)
      • post request in Requests
        • 360 translation examples (C31)
  • Practical operation of Cookie and Session
    • concept
      • Cookie(C1)
      • Session(C1)
      • Login process
      • What are the benefits of doing this?
      • Cookie and Session Practical Operation (C31)

[Mind Map]: https://www.zhixi.com/view/96e707b1

urllib module

urllib understands (c1)

The urllib library is Python’s built-in HTTP request library. The upper interface provided by the urllib module makes accessing data on www and ftp the same as accessing local files. There are the following modules:

urllib.error Exception handling module
urllib.parse url parsing module
urllib.robotparser robots.txt parsing module
urllib.request Request module

Details of urllib.request module

Common methods (C21)

urllib.request.urlopen(“URL”/ “Request object”) Initiate a request to the website and get a response, urlopen(), does not support refactoring User-Agent
urllib.request. Request(“url”, headers=”dictionary”) returns a url with the request header
read() Read the contents of the server response
byte stream response.read() string response.read().decode(“utf-8 “)
getcode() Return HTTP response code
geturl() Return the URL of the actual data (to prevent redirection problems)

UA format and acquisition method (C21)

UA adopts dictionary format header = {“User-Agent”:”UA address”}

Obtaining method: webpage –> check –> network –> refresh the webpage –> find the webpage address –> headers –> pull down to find UA

urllib web page request operation (C31)

import urllib.request

# Construct the URL object req containing the request header. If you don't need the request header, don't use urllib.request.Request(url, headers=header) to construct req
url = "https://www.baidu.com"

# UA format {"User-Agent":"object data"}
header = {<!-- -->"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36\ "}
req = urllib. request. Request(url, headers=header)

# send request
response = urllib.request.urlopen(req)

# Get the response and convert it into visual data
html1 = response.read() # get byte stream
print("html1:",html1)

html2 = response.read().decode() # Convert byte stream to string output
print("html2:",html2)

# Get other parameters, not necessary
data1 = response.getcode() # Get the response code of html
data2 = response.geturl() # Get the URL of the actual data

Details of urllib.parse module

The parse module provides methods for escaping Chinese characters into hexadecimal characters

Common method (C21)

  • urlencode (dictionary)
  • quote(string) (the parameter in this is a string)

Urlencode and quote combat (C31)

import urllib.request
import urllib. parse

url1 = "https://www.baidu.com/s?wd=One Piece"
url2 = "https://www.baidu.com/s?wd=One Piece"
url3 = "https://www.baidu.com/s?wd=python"

# When url contains Chinese characters, urlopen will report an error, the following is an example of error reporting
#"UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)"
req = urllib.request.urlopen(url1)
print(req)


# The solution is to escape Chinese characters into % + hexadecimal mode, urlli.parse provides two methods to achieve

# Method 1 urllib.parse.urlencode (data in dictionary form)
r = {<!-- -->"wd":"One Piece"}
r_remake = urllib.parse.urlencode(r) #Decode r into hexadecimal
print(r_remake) # return is not a dictionary format
url = "https://www.baidu.com/s?" + r_remake # Refactor url
response1 = urllib.request.urlopen(url) # make a request

# Method 2 urllib.parse.quote(string data)
r = "One Piece"
r_remake = urllib. parse. quote(r)
print(r_remake)
url = "https://www.baidu.com/s?wd=" + r_remake
response2 = urllib.request.urlopen(url)

Analysis of urllib save picture example (C1)

from urllib.request import urlretrieve

# urllib.request.urlretrieve method
"""Parameter Description:
url: external or local url
filename: Specifies the path to save to the local (if this parameter is not specified, urllib will generate a temporary file to save the data);
reporthook: is a callback function, we can use this callback function to display the current download progress.
data: refers to the data posted to the server. This method returns a tuple (filename, headers) containing two elements, filename represents the path saved to the local, and header represents the response header of the server.
"""
url = "https://img0.baidu.com/it/u=1687867,2906233533 & amp;fm=253 & amp;app=138 & amp;size=w931 & amp;n=0 & amp;f= JPEG &fmt=auto?sec=1689526800 &t=45c19714c9e59e9fac0a3f3f43df2fbe"
urlretrieve(url, filename="a picture.jpg")

# requests method
import requests
url = "https://img1.baidu.com/it/u=1597761366,2823600315 & amp;fm=253 & amp;fmt=auto & amp;app=138 & amp;f=JPEG?w=889 & amp;h=500"
response = requests. get(url)
with open("girl.jpg", mode="wb") as f:
    f.write(response.content)

Requests module

Introduction to Requests (C1)

Requests is an HTTP library written in Python, based on urllib, and using the Apache2 Licensed open source protocol. It is more convenient than urllib, can save us a lot of work, and fully meets the needs of HTTP testing. The philosophy of Requests is developed around the idioms of PEP 20, so it’s more Pythonic than urllib. More importantly, it supports Python3.

pip install third-party mirror website (C4)

pip install requests
# If the download timeout occurs, just change the source

# example
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple/

# Aliyun http://mirrors.aliyun.com/pypi/simple/
# Douban http://pypi.douban.com/simple/
# Tsinghua University https://pypi.tuna.tsinghua.edu.cn/simple/
# University of Science and Technology of China http://pypi.mirrors.ustc.edu.cn/simple/
# Huazhong University of Science and Technology http://pypi.hustunique.com/

Use of Requests

Common methods of Requests(C21)

Which method to use depends on what method is required by the webpage, check in the webpage network–>headers**

  • requests.get(“URL”)
  • request.post(“URL”)
  • Put, delete, etc. methods are not commonly used

Requests common parameters (C21)

——————-
url requested url address interface document marked interface request address
paramas request The parameters of the url can be saved here
data request data, the data format of the parameter form
json Common data request format for interfaces
headers Request header information
cookie saves user login information, such as doing some recharge functions, but requires the user to have logged in, and requires cookie information request information transmission

Requests response content (C21)

——————————-–
r.encoding Get the current encoding format, which can be found in the source code of the web page, it is easier to check directly with this
r.encoding = ‘utf-8’ Set the encoding format, directly modify the encoding format
r.text Parse and return the content with encoding. (Let you understand)
r.cookies return cookie
r.headers The server response header is stored in a dictionary object, but this dictionary is special, the dictionary key is not case-sensitive, if the key does not exist, return None
r .status_code Response status code
r.json() Requests built-in JSON decoder, returned in json form , the premise that the returned content is guaranteed to be in json format, otherwise an exception will be thrown if the parsing fails
r.content returns the word in byte form (binary) The response body in stanza mode will automatically decode gzip and deflate compression.

Parameter application of get request in Requests

Type 1: Add parameters to the url link (C31)

Points to note: Use ? between the interface and the parameter Links, parameters are expressed in the form of key=value, and multiple parameters are linked with & amp; symbols

import requests

url = "https://www.baidu.com/s?ie=utf-8 & amp;f=8 & amp;rsv_bp=1 & amp;tn=baidu & amp;wd=beauty& amp;oq= %E5%9B%BE%E7%89%87%E9%AB%98%E6%B8%85 &rsv_pq=bed3e6b1006dacaa &rsv_t=64d7QQCXzzGa7ZaT7lTtLojRuyL3gMlPYPnafKNs/xgiDAE8Y+Z1rNb9AsU & rqlang=cn &rsv_dl =tb & amp;rsv_enter=1 &rsv_btype=t &inputT=1775 &rsv_sug3=28 &rsv_sug1=21 &rsv_sug7=100 &rsv_sug2=0 &rsv_sug4=1775 "
headers = {<!-- -->"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36\ "}

html = requests.get(url=url, headers=headers).text # .text converts html into a visual format
print(html)

Second: add parameters to params (C31)

Points to note: The data type of params is dictionary data, which must also satisfy key-value pairs

import requests

url = "https://www.baidu.com/s?"
headers = {<!-- -->
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
params = {<!-- -->
    "ie": "utf-8",
    "f": "8",
    "rsv_bp": "1",
    "tn": "baidu",
    "wd": "Beauty",
    "oq": "Picture HD",
    "rsv_pq": "bed3e6b1006dacaa",
    "rsv_t": "64d7QQCXzzGa7ZaT7lTtLojRuyL3gMlPYPnafKNs/xgiDAE8Y + Z1rNb9AsU",
    "rqlang": "cn",
    "rsv_dl": "tb",
    "rsv_enter": "1",
    "rsv_btype": "t",
    "inputT": "1775",
    "rsv_sug3": "28",
    "rsv_sug1": "21",
    "rsv_sug7": "100",
    "rsv_sug2": "0",
    "rsv_sug4": "1775",
}

html = requests.get(url, headers=headers, params=params).text
print(html)

Extension: Quickly match parameters in headers to dictionary data (C31)

Use regular replacement
1 ctrl + r select

2 (.*):\s(.*)$
  "$1":"$2", (note there is a comma here)

3 regax check

4 replace all

post request in Requests

Psot request and get request are the two most widely used request methods in web pages at present. What is the biggest difference between post request and get request? Post is used in situations where data needs to be provided like web pages, such as Baidu translation, we need to input Chinese, web pages Then return to us the translated text, the simplest is the webpage that needs to be logged in, and the account password must be provided.

  1. When the webpage requires login;

  2. When it is necessary to transfer content to the web page.

    The post usage is mostly the same as the get request, but the data parameter needs to be added. The specific meaning of the data parameter can be experienced in the following translation examples. The culture is not high and it is not easy to describe

    Grammar format:

    response = requests.post("http://www.baidu.com/", data=data,headers=headers)
    

360 translation example (C31)

# -*- coding:utf-8 -*-
import requests

# make url
asd = input("Please enter the word you want to translate:")
model = input("Please enter the translation mode: 0, Chinese to English 1, English to Chinese")
url = "https://fanyi.so.com/index/search?eng={model} & amp;validate= & amp;ignore_trans=0 & amp;query={asd}" # Compare data and find query will change
header = {<!-- -->
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Pro": "fanyi"}
data = {<!-- -->"eng": model,
        "validate": "",
        "ignore_trans": "0",
        "query": asd} # Compare the data and find that the query will change

# request
response = requests. post(url, headers=header, data=data)

##analysis
# response.encoding = "utf-8" # Set the data encoding to utf-8
# data = response.text # Decode the returned data according to the encoding code
# print(data) # Check the data and find that there are some incomprehensible encoding formats, which may be json encoding

data = response.json() # decode the returned data json
print(data) # Check the data, it is in an understandable format, go back to the source code of the webpage to find the location of the desired data, and find it under ["data"]["fanyi"]
print(asd," translates to:",data["data"]["fanyi"]) # output translation result

# Next, we also want to achieve English translation, continue to view the webpage
#It is found that the eng value in the url has changed to 1, and it is speculated that 0 and 1 represent Chinese to English and English to Chinese respectively, so the translation mode can be changed like changing the translation content
# #Just use the following code to send the package
# model = input("Please enter the translation mode: 0, Chinese to English 1, English to Chinese")
# asd = input("Please enter the word you want to translate:")

Practical operation of Cookie and Session

In the process of browsing the website, we often encounter the need to log in, and some pages can only be accessed after logging in. After logging in, you can visit the website many times in a row, but sometimes you need to log in again after a period of time. There are also some websites that are automatically logged in when the browser is opened, and will not be invalid for a long time. What is the situation? In fact, it involves the knowledge of Session and Cookie.

Concept

Cookie(C1)

  • Identify the user by information logged on the client side

HTTP is a connectionless protocol. The interaction between the client and the server is limited to the request/response process. After the end, it is disconnected. The next request, the server will consider it a new client. In order to maintain the connection between them, let the server Knowing that this is a request initiated by a previous user, the client information must be saved in one place.

Session(C1)

  • ** Chinese is called a session, and the user identity is determined by the information recorded on the server ****

The session here refers to a session. Its original meaning refers to a series of actions and news from beginning to end. For example, when making a call, a series of processes from picking up the phone to dialing to hanging up the phone can be called a Session.

Login process

What are the benefits of doing this?

The biggest advantage is that the user only needs to enter the account password once, and then when visiting the web page, only need to use the Cookie to include the Session_id in the Headers information, and the background can judge whether the user is logged in or not based on the Session_id.

Cookie and Session Practical Operation (C31)

import requests

# make url
url = "https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2023-07-18 & amp;leftTicketDTO.from_station=CDW & amp;leftTicketDTO.to_station=CQW & amp;purpose_codes= ADULT"
header = {<!-- -->
    "Cookie": "_uab_collina=168947424720608530945464; JSESSIONID=608409C2C0215B39C2721585BD3E4BD8; BIGipServerotn=2246574346.64545.0000; fo=undefined; BIG ipServerpassport=937951498.50215.0000; guidesStatus=off; highContrastMode=defaultMode; cursorStatus=off; route=c5c62a339e7744272a54643b3be5bf64; _jc_save_toDate= 2023-07-16; _jc_save_wfdc_flag=dc; _jc_save_fromStation=%u6210%u90FD,CDW; _jc_save_toStation=%u91CD%u5E86,CQW; _jc_save_fromDate=2023-07-18",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
# set request
response = requests. get(url, headers=header)
response.encoding = "utf-8"
data = response.json() # json decoded raw data

data2 = data["data"]["result"] # Get the data where the target information is located
# print(data2) # Number of target information entries


for i in data2:
    # Each i represents each train information
    temp_list = i.split('|')
    if temp_list[34] == "M0O0P0":
        if temp_list[25] != "None" and temp_list[25] != "":
            print(temp_list[3], "Business class has tickets", "Remaining tickets:", temp_list[25], "Zhang")
        else:
            print(temp_list[3], "Business class without ticket")
    else:
        if temp_list[32] != "None" and temp_list[32] != "":
            print(temp_list[3], "Business class has tickets", "Remaining tickets:", temp_list[32], "Zhang")
        else:
            print(temp_list[3], "Business class without ticket")

"""
M0O0P0 34
train number 3
Business Class seat 25
First Class 31
Second class 30

90M0O0 34
train number 3
Business 32
first class 31
Second class 30
"""