Article directory
- urllib module
-
- urllib understands (c1)
- **Details of urllib.request module**
-
- Commonly used methods (C21)
- UA format and acquisition method (C21)
- urllib web page request operation (C31)
- urllib.parse module in detail
-
- Common Methods (C21)
- urlencode and quote combat (C31)
- Example analysis of urllib saving pictures (C1)
- Requests module
-
- Introduction to Requests (C1)
- pip install third-party mirror website (C4)
- Use of Requests
-
- **Common methods of Requests**(C21)
- Common parameters of Requests (C21)
- Requests response content (C21)
- Parameter application of get request in Requests
-
- The first one: add parameters to the url link (C31)
- The second: add parameters to params (C31)
- Extension: Quickly match parameters in headers to dictionary data (C31)
- post request in Requests
-
- 360 translation examples (C31)
- Practical operation of Cookie and Session
-
- concept
-
- Cookie(C1)
- Session(C1)
- Login process
- What are the benefits of doing this?
- Cookie and Session Practical Operation (C31)
[Mind Map]: https://www.zhixi.com/view/96e707b1
urllib module
urllib understands (c1)
The urllib library is Python’s built-in HTTP request library. The upper interface provided by the urllib module makes accessing data on www and ftp the same as accessing local files. There are the following modules:
urllib.error | Exception handling module |
urllib.parse | url parsing module |
urllib.robotparser | robots.txt parsing module |
urllib.request | Request module |
Details of urllib.request module
Common methods (C21)
urllib.request.urlopen(“URL”/ “Request object”) | Initiate a request to the website and get a response, urlopen(), does not support refactoring User-Agent |
urllib.request. Request(“url”, headers=”dictionary”) | returns a url with the request header |
read() | Read the contents of the server response |
byte stream response.read() | string response.read().decode(“utf-8 “) |
getcode() | Return HTTP response code |
geturl() | Return the URL of the actual data (to prevent redirection problems) |
UA format and acquisition method (C21)
UA adopts dictionary format header = {“User-Agent”:”UA address”}
Obtaining method: webpage –> check –> network –> refresh the webpage –> find the webpage address –> headers –> pull down to find UA
urllib web page request operation (C31)
import urllib.request # Construct the URL object req containing the request header. If you don't need the request header, don't use urllib.request.Request(url, headers=header) to construct req url = "https://www.baidu.com" # UA format {"User-Agent":"object data"} header = {<!-- -->"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36\ "} req = urllib. request. Request(url, headers=header) # send request response = urllib.request.urlopen(req) # Get the response and convert it into visual data html1 = response.read() # get byte stream print("html1:",html1) html2 = response.read().decode() # Convert byte stream to string output print("html2:",html2) # Get other parameters, not necessary data1 = response.getcode() # Get the response code of html data2 = response.geturl() # Get the URL of the actual data
Details of urllib.parse module
The parse module provides methods for escaping Chinese characters into hexadecimal characters
Common method (C21)
- urlencode (dictionary)
- quote(string) (the parameter in this is a string)
Urlencode and quote combat (C31)
import urllib.request import urllib. parse url1 = "https://www.baidu.com/s?wd=One Piece" url2 = "https://www.baidu.com/s?wd=One Piece" url3 = "https://www.baidu.com/s?wd=python" # When url contains Chinese characters, urlopen will report an error, the following is an example of error reporting #"UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)" req = urllib.request.urlopen(url1) print(req) # The solution is to escape Chinese characters into % + hexadecimal mode, urlli.parse provides two methods to achieve # Method 1 urllib.parse.urlencode (data in dictionary form) r = {<!-- -->"wd":"One Piece"} r_remake = urllib.parse.urlencode(r) #Decode r into hexadecimal print(r_remake) # return is not a dictionary format url = "https://www.baidu.com/s?" + r_remake # Refactor url response1 = urllib.request.urlopen(url) # make a request # Method 2 urllib.parse.quote(string data) r = "One Piece" r_remake = urllib. parse. quote(r) print(r_remake) url = "https://www.baidu.com/s?wd=" + r_remake response2 = urllib.request.urlopen(url)
Analysis of urllib save picture example (C1)
from urllib.request import urlretrieve # urllib.request.urlretrieve method """Parameter Description: url: external or local url filename: Specifies the path to save to the local (if this parameter is not specified, urllib will generate a temporary file to save the data); reporthook: is a callback function, we can use this callback function to display the current download progress. data: refers to the data posted to the server. This method returns a tuple (filename, headers) containing two elements, filename represents the path saved to the local, and header represents the response header of the server. """ url = "https://img0.baidu.com/it/u=1687867,2906233533 & amp;fm=253 & amp;app=138 & amp;size=w931 & amp;n=0 & amp;f= JPEG &fmt=auto?sec=1689526800 &t=45c19714c9e59e9fac0a3f3f43df2fbe" urlretrieve(url, filename="a picture.jpg") # requests method import requests url = "https://img1.baidu.com/it/u=1597761366,2823600315 & amp;fm=253 & amp;fmt=auto & amp;app=138 & amp;f=JPEG?w=889 & amp;h=500" response = requests. get(url) with open("girl.jpg", mode="wb") as f: f.write(response.content)
Requests module
Introduction to Requests (C1)
Requests is an HTTP library written in Python, based on urllib, and using the Apache2 Licensed open source protocol. It is more convenient than urllib, can save us a lot of work, and fully meets the needs of HTTP testing. The philosophy of Requests is developed around the idioms of PEP 20, so it’s more Pythonic than urllib. More importantly, it supports Python3.
pip install third-party mirror website (C4)
pip install requests # If the download timeout occurs, just change the source # example pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple/ # Aliyun http://mirrors.aliyun.com/pypi/simple/ # Douban http://pypi.douban.com/simple/ # Tsinghua University https://pypi.tuna.tsinghua.edu.cn/simple/ # University of Science and Technology of China http://pypi.mirrors.ustc.edu.cn/simple/ # Huazhong University of Science and Technology http://pypi.hustunique.com/
Use of Requests
Common methods of Requests(C21)
Which method to use depends on what method is required by the webpage, check in the webpage network–>headers**
- requests.get(“URL”)
- request.post(“URL”)
- Put, delete, etc. methods are not commonly used
Requests common parameters (C21)
——————- | |
---|---|
url | requested url address interface document marked interface request address |
paramas | request The parameters of the url can be saved here |
data | request data, the data format of the parameter form |
json | Common data request format for interfaces |
headers | Request header information |
cookie | saves user login information, such as doing some recharge functions, but requires the user to have logged in, and requires cookie information request information transmission |
Requests response content (C21)
——————————-– | |
---|---|
r.encoding | Get the current encoding format, which can be found in the source code of the web page, it is easier to check directly with this |
r.encoding = ‘utf-8’ | Set the encoding format, directly modify the encoding format |
r.text | Parse and return the content with encoding. (Let you understand) |
r.cookies | return cookie |
r.headers | The server response header is stored in a dictionary object, but this dictionary is special, the dictionary key is not case-sensitive, if the key does not exist, return None |
r .status_code | Response status code |
r.json() | Requests built-in JSON decoder, returned in json form , the premise that the returned content is guaranteed to be in json format, otherwise an exception will be thrown if the parsing fails |
r.content | returns the word in byte form (binary) The response body in stanza mode will automatically decode gzip and deflate compression. |
Parameter application of get request in Requests
Type 1: Add parameters to the url link (C31)
Points to note: Use ? between the interface and the parameter Links, parameters are expressed in the form of key=value, and multiple parameters are linked with & amp; symbols
import requests url = "https://www.baidu.com/s?ie=utf-8 & amp;f=8 & amp;rsv_bp=1 & amp;tn=baidu & amp;wd=beauty& amp;oq= %E5%9B%BE%E7%89%87%E9%AB%98%E6%B8%85 &rsv_pq=bed3e6b1006dacaa &rsv_t=64d7QQCXzzGa7ZaT7lTtLojRuyL3gMlPYPnafKNs/xgiDAE8Y+Z1rNb9AsU & rqlang=cn &rsv_dl =tb & amp;rsv_enter=1 &rsv_btype=t &inputT=1775 &rsv_sug3=28 &rsv_sug1=21 &rsv_sug7=100 &rsv_sug2=0 &rsv_sug4=1775 " headers = {<!-- -->"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36\ "} html = requests.get(url=url, headers=headers).text # .text converts html into a visual format print(html)
Second: add parameters to params (C31)
Points to note: The data type of params is dictionary data, which must also satisfy key-value pairs
import requests url = "https://www.baidu.com/s?" headers = {<!-- --> "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"} params = {<!-- --> "ie": "utf-8", "f": "8", "rsv_bp": "1", "tn": "baidu", "wd": "Beauty", "oq": "Picture HD", "rsv_pq": "bed3e6b1006dacaa", "rsv_t": "64d7QQCXzzGa7ZaT7lTtLojRuyL3gMlPYPnafKNs/xgiDAE8Y + Z1rNb9AsU", "rqlang": "cn", "rsv_dl": "tb", "rsv_enter": "1", "rsv_btype": "t", "inputT": "1775", "rsv_sug3": "28", "rsv_sug1": "21", "rsv_sug7": "100", "rsv_sug2": "0", "rsv_sug4": "1775", } html = requests.get(url, headers=headers, params=params).text print(html)
Extension: Quickly match parameters in headers to dictionary data (C31)
Use regular replacement 1 ctrl + r select 2 (.*):\s(.*)$ "$1":"$2", (note there is a comma here) 3 regax check 4 replace all
post request in Requests
Psot request and get request are the two most widely used request methods in web pages at present. What is the biggest difference between post request and get request? Post is used in situations where data needs to be provided like web pages, such as Baidu translation, we need to input Chinese, web pages Then return to us the translated text, the simplest is the webpage that needs to be logged in, and the account password must be provided.
-
When the webpage requires login;
-
When it is necessary to transfer content to the web page.
The post usage is mostly the same as the get request, but the data parameter needs to be added. The specific meaning of the data parameter can be experienced in the following translation examples. The culture is not high and it is not easy to describe
Grammar format:
response = requests.post("http://www.baidu.com/", data=data,headers=headers)
360 translation example (C31)
# -*- coding:utf-8 -*- import requests # make url asd = input("Please enter the word you want to translate:") model = input("Please enter the translation mode: 0, Chinese to English 1, English to Chinese") url = "https://fanyi.so.com/index/search?eng={model} & amp;validate= & amp;ignore_trans=0 & amp;query={asd}" # Compare data and find query will change header = {<!-- --> "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36", "Pro": "fanyi"} data = {<!-- -->"eng": model, "validate": "", "ignore_trans": "0", "query": asd} # Compare the data and find that the query will change # request response = requests. post(url, headers=header, data=data) ##analysis # response.encoding = "utf-8" # Set the data encoding to utf-8 # data = response.text # Decode the returned data according to the encoding code # print(data) # Check the data and find that there are some incomprehensible encoding formats, which may be json encoding data = response.json() # decode the returned data json print(data) # Check the data, it is in an understandable format, go back to the source code of the webpage to find the location of the desired data, and find it under ["data"]["fanyi"] print(asd," translates to:",data["data"]["fanyi"]) # output translation result # Next, we also want to achieve English translation, continue to view the webpage #It is found that the eng value in the url has changed to 1, and it is speculated that 0 and 1 represent Chinese to English and English to Chinese respectively, so the translation mode can be changed like changing the translation content # #Just use the following code to send the package # model = input("Please enter the translation mode: 0, Chinese to English 1, English to Chinese") # asd = input("Please enter the word you want to translate:")
Practical operation of Cookie and Session
In the process of browsing the website, we often encounter the need to log in, and some pages can only be accessed after logging in. After logging in, you can visit the website many times in a row, but sometimes you need to log in again after a period of time. There are also some websites that are automatically logged in when the browser is opened, and will not be invalid for a long time. What is the situation? In fact, it involves the knowledge of Session and Cookie.
Concept
Cookie(C1)
- Identify the user by information logged on the client side
HTTP is a connectionless protocol. The interaction between the client and the server is limited to the request/response process. After the end, it is disconnected. The next request, the server will consider it a new client. In order to maintain the connection between them, let the server Knowing that this is a request initiated by a previous user, the client information must be saved in one place.
Session(C1)
- ** Chinese is called a session, and the user identity is determined by the information recorded on the server ****
The session here refers to a session. Its original meaning refers to a series of actions and news from beginning to end. For example, when making a call, a series of processes from picking up the phone to dialing to hanging up the phone can be called a Session.
Login process
What are the benefits of doing this?
The biggest advantage is that the user only needs to enter the account password once, and then when visiting the web page, only need to use the Cookie to include the Session_id in the Headers information, and the background can judge whether the user is logged in or not based on the Session_id.
Cookie and Session Practical Operation (C31)
import requests # make url url = "https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2023-07-18 & amp;leftTicketDTO.from_station=CDW & amp;leftTicketDTO.to_station=CQW & amp;purpose_codes= ADULT" header = {<!-- --> "Cookie": "_uab_collina=168947424720608530945464; JSESSIONID=608409C2C0215B39C2721585BD3E4BD8; BIGipServerotn=2246574346.64545.0000; fo=undefined; BIG ipServerpassport=937951498.50215.0000; guidesStatus=off; highContrastMode=defaultMode; cursorStatus=off; route=c5c62a339e7744272a54643b3be5bf64; _jc_save_toDate= 2023-07-16; _jc_save_wfdc_flag=dc; _jc_save_fromStation=%u6210%u90FD,CDW; _jc_save_toStation=%u91CD%u5E86,CQW; _jc_save_fromDate=2023-07-18", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"} # set request response = requests. get(url, headers=header) response.encoding = "utf-8" data = response.json() # json decoded raw data data2 = data["data"]["result"] # Get the data where the target information is located # print(data2) # Number of target information entries for i in data2: # Each i represents each train information temp_list = i.split('|') if temp_list[34] == "M0O0P0": if temp_list[25] != "None" and temp_list[25] != "": print(temp_list[3], "Business class has tickets", "Remaining tickets:", temp_list[25], "Zhang") else: print(temp_list[3], "Business class without ticket") else: if temp_list[32] != "None" and temp_list[32] != "": print(temp_list[3], "Business class has tickets", "Remaining tickets:", temp_list[32], "Zhang") else: print(temp_list[3], "Business class without ticket") """ M0O0P0 34 train number 3 Business Class seat 25 First Class 31 Second class 30 90M0O0 34 train number 3 Business 32 first class 31 Second class 30 """